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Preface 


Read This First 


About This Manual 


This user’s guide serves as an applications reference book for the 
TMS320C40 and TMS320C44 digital signal processors (DSP). Throughout 
the book, all references to the TMS320C4x apply to both devices (exceptions 
are noted). 


Specifically, this book complements the TMS320C 4x User’s Guide by provid- 
ing information to assist managers and hardware/software engineers in appli- 
cation development. It includes example code and hardware connections for 
various applications. 


The guide shows how to use the instruction set, the architecture, and the ’C4x 
interface. It presents examples for frequently used applications and discusses 
more involved examples and applications. It also defines the principles in- 
volved in many applications and gives the corresponding assembly language 
code for instructional purposes and for immediate use. Whenever the detailed 
explanation of the underlying theory is too extensive to be included in this man- 
ual, appropriate references are given for further information. 


How to Use This Book 


The following table summarizes the ’C4x information contained in this user’s 


guide: 
If you are looking for 
information about: Turn to these chapters: 
Arithmetic Chapter 3, Logical and Arithmetic Operations 
Communication Ports Chapter 8, Using the Communication Ports 
Companding Chapter 6, Applications-Oriented Operations 
Development Support Chapter 10, Development Support and Part Or- 


der Information 


Style and Symbol Conventions 


If you are looking for 


information about: Turn to these chapters: 

DMA Coprocessor Chapter 7, Programming the DMA Coprocessor 

FFTs Chapter 6, Applications-Oriented Operations 

Filters Chapter 6, Applications-Oriented Operations 

Ordering Parts Chapter 10, Development Support and Part Or- 
der Information 

Repeat Modes Chapter 2, Program Control 

Reset Chapter 1, Processor Initialization 

Stacks Chapter 2, Program Control 

Tips Chapter 5, Programming Tips 

Wait States Chapter 4, Memory Interfacing 

XDS510 Emulator Chapter 11, XDS510 Emulator Design Consider- 
ations 


Style and Symbol Conventions 
This document uses the following conventions: 


Lj Program listings, program examples, file names, and symbol names are 
shown in a special font. Examples use a bold version of the special font 
for emphasis. Here is a sample program listing segment: 


* 


LOOP1 RPTB MAX 
CMPF *ARO,RO ;Compare number to the maximum 
MAX LDFLT *ARO,RO ;If greater, this is a new max 
B NEXT 
LOOP2 RPTB MIN 
CMPF *ARO++(1),RO ;Compare number to the minimum 
MIN LDFLT *-ARO(1),RO ;If smaller, this is new minimum 


(1 Throughout this book MSB and LSB indicate most significant bit and least 
significant bit, respectively. MS byte and LS byte indicate most significant 
byte and least significant byte, respectively. 


Information About Cautions and Warnings 


Information About Cautions and Warnings 


This book may contain cautions and warnings. 


This is an example of a caution statement. 


A caution statement describes a situation that could potentially 
damage your software or equipment. 


This is an example of a warning statement. 


A warning statement describes a situation that could potentially 
cause harm to you. 


The information in a caution or a warning is provided for your protection. 
Please read each caution and warning carefully. 
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Related Documentation from Texas Instruments 


Related Documentation From Texas Instruments 


vi 


The following books describe the TMS320 floating-point devices and related 
support tools. To obtain a copy of any of these Tl documents, call the Texas 
Instruments Literature Response Center at (800) 477-8924. When ordering, 
please identify the book by its title and literature number. 


TMS320C4x User’s Guide (literature number SPRU063) describes the 'C4x 
32-bit floating-point processor, developed for digital signal processing as 
well as parallel processing applications. Covered are its architecture, in- 
ternal register structure, instruction set, pipeline, specifications, and op- 
eration of its six DMA channels and six communication ports. 


TMS320C4x Parallel Processing Development System Technical Refer- 
ence (literature number SPRUO75) describes the TMS320C4x parallel 
processing system, a system with four C4xs with shared and distributed 
memory. 


Parallel Processing with the TMS320C4x (literature number SPRA031) de- 
scribes parallel processing and how the ’C40 can be used in parallel pro- 
cessing. Also provides sample parallel processing applications. 


TMS320 Floating-Point DSP Assembly Language Tools User’s Guide (lit- 
erature number SPRU035) describes the assembly language tools (as- 
sembler, linker, and other tools used to develop assembly language 
code), assembler directives, macros, common object file format, and 
symbolic debugging directives for the ’C3x and ’C4x generations of de- 
vices. 


TMS320 Floating-Point DSP Optimizing C Compiler User’s Guide (litera- 
ture number SPRU034) describes the TMS320 floating-point C compiler. 
This C compiler accepts ANSI standard C source code and produces 
TMS320 assembly language source code for the ’C3x and ’C4x genera- 
tions of devices. 


TMS320C4x C Source Debugger User’s Guide (literature number 
SPRU054) tells you how to invoke the ’'C4x emulator and simulator ver- 
sions of the C source debugger interface. This book discusses various 
aspects of the debugger interface, including window management, com- 
mand entry, code execution, data management, and breakpoints. It also 
includes a tutorial that introduces basic debugger functionality. 


TMS320C4x Technical Brief (literature number SPRUO76) gives a con- 
densed overview of the ’C4x DSP and its development tools. It also lists 
TMS320C4x third parties. 


Related Articles and Books 


TMS320 Family Development Support Reference Guide (literature number 
SPRUO011) describes the ’320 family of digital signal processors and the 
various products that support it. This includes code-generation tools 
(compilers, assemblers, linkers, etc.) and system integration and debug 
tools (simulators, emulators, evaluation modules, etc.). This book also 
lists related documentation, outlines seminars and the university pro- 
gram, and gives factory repair and exchange information. 


TMS320 Third-Party Support Reference Guide (literature number 
SPRUO052) alphabetically lists over 100 third parties that supply various 
products that serve the family of 320 digital signal processors—software 
and hardware development tools, speech recognition, image process- 
ing, noise cancellation, modems, etc. 


TMS320 DSP Designer’s Notebook: Volume 1 (SPRT125). Presents solu- 
tions to common design problems using ’C2x, ’'C3x, ’C4x, ’'C5x, and other 
TI DSPs. 


Related Articles and Books 


A wide variety of related documentation is available on digital signal process- 
ing. These references fall into one of the following application categories: 


General-Purpose DSP 
Graphics/Imagery 
Speech/Voice 

Control 

Multimedia 

Military 
Telecommunications 
Automotive 
Consumer 

Medical 

Development Support 


DUUOUOUUUUOUUU 


In the following list, references appear in alphabetical order according to au- 
thor. The documents contain beneficial information regarding designs, opera- 
tions, and applications for signal-processing systems; all of the documents 
provide additional references. Texas Instruments strongly suggests that you 
refer to these publications both before and during the design process. 


General-Purpose DSP: 


1) Antoniou, A., Digital Filters: Analysis and Design, New York, NY: 
McGraw-Hill Company, Inc., 1979. 


2) Brigham, E.O., The Fast Fourier Transform, Englewood Cliffs, NJ: Pren- 
tice-Hall, Inc., 1974. 
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Related Articles and Books 


viii 


3) Burrus, C.S., and T.W. Parks, DFT/FFT and Convolution Algorithms, New 
York, NY: John Wiley and Sons, Inc., 1984. 


4) Chassaing, R., Horning, D.W., ‘Digital Signal Processing with Fixed and 
Floating-Point Processors. ” COED, USA, Volume 1, Number 1, pages 1-4, 
March 1991. 


5) Defatta, David J., Joseph G. Lucas, and William S. Hodgkiss, Digital Sig- 
nal Processing: A System Design Approach, New York: John Wiley, 1988. 


6) Erskine, C., and S. Magar, “Architecture and Applications of a Second- 
Generation Digital Signal Processor.” Proceedings of IEEE International 
Conference on Acoustics, Speech, and Signal Processing, USA, 1985. 


7) Essig, D., C. Erskine, E. Caudel, and S. Magar, “A Second-Generation 
Digital Signal Processor.” /EEE Journal of Solid-State Circuits, USA, Vol- 
ume SC-—21, Number 1, pages 86—91, February 1986. 


8) Frantz, G., K. Lin, J. Reimer, and J. Bradley, “The Texas Instruments 
TMS320C25 Digital Signal Microcomputer.” /EEE Microelectronics, USA, 
Volume 6, Number 6, pages 10-28, December 1986. 


9) Gass, W., R. Tarrant, T. Richard, B. Pawate, M. Gammel, P. Rajasekaran, 
R. Wiggins, and C. Covington, “Multiple Digital Signal Processor Environ- 
ment for Intelligent Signal Processing.” Proceedings of the IEEE, USA, 
Volume 75, Number 9, pages 1246-1259, September 1987. 


10) Gold, Bernard, and C.M. Rader, Digital Processing of Signals, New York, 
NY: McGraw-Hill Company, Inc., 1969. 


11) Hamming, R.W., Digital Filters, Englewood Cliffs, NJ: Prentice-Hall, Inc., 
1977. 


12) IEEE ASSP DSP Committee (Editor), Programs for Digital Signal Pro- 
cessing, New York, NY: IEEE Press, 1979. 


13) Jackson, Leland B., Digital Filters and Signal Processing, Hingham, MA: 
Kluwer Academic Publishers, 1986. 


14) Jones, D.L., and T.W. Parks, A Digital Signal Processing Laboratory Using 
the TMS32010, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


15) Lim, Jae, and Alan V. Oppenheim, Advanced Topics in Signal Processing, 
Englewood Cliffs, NJ: Prentice- Hall, Inc., 1988. 


16) Lin, K., G. Frantz, and R. Simar, Jr., “The TMS320 Family of Digital Signal 
Processors.” Proceedings of the IEEE, USA, Volume 75, Number 9, pages 
1143-1159, September 1987. 


Related Articles and Books 


17) Lovrich, A., Reimer, J., “An Advanced Audio Signal Processor.” Digest of 
Technical Papers for 1991 International Conference on Consumer Elec- 
tronics, June 1991. 


18) Magar, S., D. Essig, E. Caudel, S. Marshall and R. Peters, “An NMOS Digi- 
tal Signal Processor with Multiprocessing Capability.” Digest of IEEE Inter- 
national Solid-State Circuits Conference, USA, February 1985. 


19) Morris, Robert L., Digital Signal Processing Software, Ottawa, Canada: 
Carleton University, 1983. 


20) Oppenheim, Alan V. (Editor), Applications of Digital Signal Processing, 
Englewood Cliffs, NJ: Prentice-Hall, Inc., 1978. 


21) Oppenheim, Alan V., and R.W. Schafer, Digital Signal Processing, Engle- 
wood Cliffs, NJ: Prentice-Hall, Inc., 1975 and 1988. 


22) Oppenheim, A.V., A.N. Willsky, and I.T. Young, Signals and Systems, En- 
glewood Cliffs, NJ: Prentice-Hall, Inc., 1983. 


23) Papamichalis, P.E.,andC.S. Burrus, “Conversion of Digit-Reversed to Bit- 
Reversed Order in FFT Algorithms.” Proceedings of ICASSP 89, USA, 
pages 984-987, May 1989. 


24) Papamichalis, P., and R. Simar, Jr., “The TMS320C30 Floating-Point Digi- 
tal Signal Processor.” IEEE Micro Magazine, USA, pages 13-29, Decem- 
ber 1988. 


25) Parks, T.W., and C.S. Burrus, Digital Filter Design, New York, NY: John 
Wiley and Sons, Inc., 1987. 


26) Peterson, C., Zervakis, M., Shehadeh, N., “Adaptive Filter Design and 
Implementation Using the TMS320C25 Microprocessor.” Computers in 
Education Journal, USA, Volume 3, Number 3, pages 12-16, July—Sep- 
tember 1993. 


27) Prado, J., and R. Alcantara, “A Fast Square-Rooting Algorithm Using a 
Digital Signal Processor.” Proceedings of IEEE, USA, Volume 75, Number 
2, pages 262-264, February 1987. 


28) Rabiner, L.R. and B. Gold, Theory and Applications of Digital Signal Pro- 
cessing, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1975. 


29) Simar, Jr., R., and A. Davis, “The Application of High-Level Languages to 
Single-Chip Digital Signal Processors.” Proceedings of ICASSP 88, USA, 
Volume D, page 1678, April 1988. 


30) Simar, Jr., R., T. Leigh, P. Koeppen, J. Leach, J. Potts, and D. Blalock, “A 
40 MFLOPS Digital Signal Processor: the First Supercomputer on a Chip.” 
Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 
1, pages 535-538, April 1987. 
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Related Articles and Books 


31) Simar, Jr., R., andJ. Reimer, “The TMS320C25: a 100 ns CMOS VLSI Dig- 


ital Signal Processor.” 1986 Workshop on Applications of Signal Process- 
ing to Audio and Acoustics, September 1986. 


32) Texas Instruments, Digital Signal Processing Applications with the 


TMS320 Family, 1986; Englewood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


33) Treichler, J.R., C.R. Johnson, Jr., and M.G. Larimore, A Practical Guide 


to Adaptive Filter Design, New York, NY: John Wiley and Sons, Inc., 1987. 


Graphics/Imagery: 


1) 


Andrews, H.C., and B.R. Hunt, Digital Image Restoration, Englewood 
Cliffs, NJ: Prentice-Hall, Inc., 1977. 


Gonzales, Rafael C., and Paul Wintz, Digital Image Processing, Reading, 
MA: Addison-Wesley Publishing Company, Inc., 1977. 


Papamichalis, P.E., “FFT Implementation on the TMS320C30.” Proceea- 
ings of ICASSP 88, USA, Volume D, page 1399, April 1988. 


Pratt, William K., Digital Image Processing, New York, NY: John Wiley and 
Sons, 1978. 


Reimer, J., and A. Lovrich, “Graphics with the TMS32020.” WESCON/85 
Conference Record, USA, 1985. 


Speech/Voice: 


1) 


DellaMorte, J., and P. Papamichalis, “Full-Duplex Real-Time Implementa- 
tion of the FED-STD-1015 LPC-10e Standard V.52 on the TMS320C25.” 
Proceedings of SPEECH TECH 89, pages 218-221, May 1989. 


Frantz, G.A., and K.S. Lin, “A Low-Cost Speech System Using the 
TMS320C17.” Proceedings of SPEECH TECH ’87, pages 25-29, April 
1987. 


Gray, A.H., and J.D. Markel, Linear Prediction of Soeech, New York, NY: 
Springer-Verlag, 1976. 


Jayant, N.S., and Peter Noll, Digital Coding of Waveforms, Englewood 
Cliffs, NJ: Prentice-Hall, Inc., 1984. 


Papamichalis, Panos, Practical Approaches to Speech Coding, Engle- 
wood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


Papamichalis, P., and D. Lively, “Implementation of the DOD Standard 
LPC-10/52E on the TMS320C25.” Proceedings of SPEECH TECH ’87, 
pages 201-204, April 1987. 


Pawate, B.I., and G.R. Doddington, “Implementation of a Hidden Markov 
Model-Based Layered Grammar Recognizer.” Proceedings of ICASSP 
89, USA, pages 801-804, May 1989. 


Rabiner, L.R., and R.W. Schafer, Digital Processing of Speech Signals, 
Englewood Cliffs, NJ: Prentice-Hall, Inc., 1978. 


Related Articles and Books 


9) Reimer, J.B. and K.S. Lin, “TMS320 Digital Signal Processors in Speech 
Applications.” Proceedings of SPEECH TECH ’88, April 1988. 


10) Reimer, J.B., M.L. McMahan, and W.W. Anderson, “Speech Recognition 
for a Low-Cost System Using a DSP.” Digest of Technical Papers for 1987 
International Conference on Consumer Electronics, June 1987. 


Control: 


1) Ahmed, I., “16-Bit DSP Microcontroller Fits Motion Control System Ap- 
plication.” PCIM, October 1988. 


2) Ahmed, I., “Implementation of Self Tuning Regulators with TMS320 Fami- 
ly of Digital Signal Processors.” MOTORCON ’88, pages 248-262, Sep- 
tember 1988. 


3) Ahmed, I., and S. Lindquist, “Digital Signal Processors: Simplifying High- 
Performance Control.” Machine Design, September 1987. 


4) Ahmed, I., and S. Meshkat, “Using DSPs in Control.” Contro/ Engineering, 
February 1988. 


5) Allen, C. and P. Pillay, “TMS320 Design for Vector and Current Control of 
AC Motor Drives.” Electronics Letters, UK, Volume 28, Number 23, pages 
2188-2190, November 1992. 


6) Bose, B.K., and P.M. Szczesny, “A Microcomputer-Based Control and 
Simulation of an Advanced IPM Synchronous Machine Drive System for 
Electric Vehicle Propulsion.” Proceedings of IECON ’87, Volume 1, pages 
454-463, November 1987. 


7) Hanselman, H., “LQG-Control of a Highly Resonant Disc Drive Head Posi- 
tioning Actuator.” IEEE Transactions on Industrial Electronics, USA, Vol- 
ume 35, Number 1, pages 100-104, February 1988. 


8) Jacquot, R., Modern Digital Control Systems, New York, NY: Marcel Dek- 
ker, Inc., 1981. 


9) Katz, P., Digital Control Using Microprocessors, Englewood Cliffs, NJ: 
Prentice-Hall, Inc., 1981. 


10) Kuo, B.C., Digital Control Systems, New York, NY: Holt, Reinholt, and 
Winston, Inc., 1980. 


11) Lovrich, A., G. Troullinos, and R. Chirayil, “An All-Digital Automatic Gain 
Control.” Proceedings of ICASSP 88, USA, Volume D, page 1734, April 
1988. 


12) Matsui, N. and M. Shigyo, ‘Brushless DC Motor Control Without Position 
and Speed Sensors.” /EEE Transactions on Industry Applications, USA, 
Volume 28, Number 1, Part 1, pages 120-127, January—February 1992. 
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Related Articles and Books 
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13) Meshkat, S., and |. Ahmed, “Using DSPs in AC Induction Motor Drives.” 
Control Engineering, February 1988. 


14) Panahi, |. and R. Restle, ‘DSPs Redefine Motion Control.” Motion Control 
Magazine, December 1993. 


15) Phillips, C., and H. Nagle, Digital Control System Analysis and Design, En- 
glewood Cliffs, NJ: Prentice-Hall, Inc., 1984. 


Multimedia: 


1) Reimer, J., ‘DSP-Based Multimedia Solutions Lead Way Enhancing Audio 
Compression Performance.” Dr. Dobbs Journal, December 1993. 


2) Reimer, J., G. Benbassat, and W. Bonneau Jr., “Application Processors: 
Making PC Multimedia Happen.” Silicon Valley PC Design Conference, 
July 1991. 


Military: 


1) Papamichalis, P., and J. Reimer, “Implementation of the Data Encryption 
Standard Using the TMS32010.” Digital Signal Processing Applications, 
1986. 


Telecommunications: 


1) Ahmed, |, and A. Lovrich, “Adaptive Line Enhancer Using the 
TMS320C25.” Conference Records of Northcon/86, USA, 14/3/1-10, 
September/October 1986. 


2) Casale, S., R. Russo, and G. Bellina, “Optimal Architectural Solution Us- 
ing DSP Processors for the Implementation of an ADPCM Transcoder.” 
Proceedings of GLOBECOM ’89, pages 1267-1273, November 1989. 


3) Cole, C., A. Haoui, and P. Winship, “A High-Performance Digital Voice 
Echo Canceller on a SINGLE TMS32020.” Proceedings of ICASSP 86, 
USA, Catalog Number 86CH2243-4, Volume 1, pages 429-432, April 
1986. 


4) Cole, C., A. Haoui, and P. Winship, “A High-Performance Digital Voice 
Echo Canceller on a Single TMS32020.” Proceedings of IEEE Internation- 
al Conference on Acoustics, Speech and Signal Processing, USA, 1986. 


5) Lovrich, A., and J. Reimer, “A Multi-Rate Transcoder.” Transactions on 
Consumer Electronics, USA, November 1989. 


6) Lovrich, A. and J. Reimer, “A Multi-Rate Transcoder.” Digest of Technical 
Papers for 1989 International Conference on Consumer Electronics, June 
7-9, 1989. 


9) 


Related Articles and Books 


Lu, H., D. Hedberg, and B. Fraenkel, “Implementation of High-Speed Voi- 
ceband Data Modems Using the TMS320C25.” Proceedings of ICASSP 
87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1915-1918, 
April 1987. 


Mock, P., “Add DTMF Generation and Decoding to DSP— uP Designs.” 
Electronic Design, USA, Volume 30, Number 6, pages 205-213, March 
1985. 


Reimer, J., M. McMahan, and M. Arjmand, “ADPCM on a TMS320 DSP 
Chip.” Proceedings of SPEECH TECH 85, pages 246-249, April 1985. 


10) Troullinos, G., and J. Bradley, “Split-Band Modem Implementation Using 


the TMS32010 Digital Signal Processor.” Conference Records of 
Electro/86 and Mini/Micro Northeast, USA, 14/1/1-21, May 1986. 


Automotive: 


1) 


Lin, K., “Trends of Digital Signal Processing in Automotive.” International 
Congress on Transportation Electronic (CONVERGENCE ’88), October 
1988. 


Consumer: 


1) 


2) 


Frantz, G.A., J.B. Reimer, and R.A. Wotiz, “Julie, The Application of DSP 
to a Product.” Speech Tech Magazine, USA, September 1988. 


Reimer, J.B., and G.A. Frantz, “Customization of a DSP Integrated Circuit 
for a Customer Product.” Transactions on Consumer Electronics, USA, 
August 1988. 


Reimer, J.B., P.E. Nixon, E.B. Boles, and G.A. Frantz, “Audio Customiza- 
tion of a DSP IC.” Digest of Technical Papers for 1988 International Con- 
ference on Consumer Electronics, June 8-10 1988. 


Medical: 


1) 


Knapp and Townshend, “A Real-Time Digital Signal Processing System 
for an Auditory Prosthesis.” Proceedings of ICASSP 88, USA, Volume A, 
page 2493, April 1988. 


Morris, L.R., and P.B. Barszczewski, “Design and Evolution of a Pocket- 
Sized DSP Speech Processing System for a Cochlear Implant and Other 
Hearing Prosthesis Applications.” Proceedings of ICASSP 88, USA, Vol- 
ume A, page 2516, April 1988. 
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Development Support: 


1) Mersereau, R., R. Schafer, T. Barnwell, and D. Smith, “A Digital Filter De- 
sign Package for PCs and TMS320.” MIDCON/84 Electronic Show and 


Convention, USA, 1984. 


2) Simar, Jr., R., and A. Davis, “The Application of High-Level Languages to 
Single-Chip Digital Signal Processors.” Proceedings of ICASSP 88, USA, 


Volume 3, pages 1678-1681, April 1988. 
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If you want to... 


Request more information about 
Texas Instruments Digital Signal 
Processing (DSP) products 


Order Texas Instruments 
documentation 


Ask questions about product 
operation or report suspected 
problems 


Obtain the source code in this 
user’s guide 


Visit TI online, including 
TI&ME™, your own customized 
web page 


Report mistakes or make com- 
ments about this or any other TI 
documentation 


Do this. . . 


Write to: 


Texas Instruments Incorporated 
Market Communications Manager 
MS 736 

P.O. Box 1443 

Houston, Texas 77251-1443 


Call the TI Literature Response Center: 


(800) 477-8924 


Contact the DSP hotline: 


Phone: (713) 274-2320 
FAX: (713) 274-2324 
Electronic Mail: 4389750@mcimail.com. 


Call the TI BBS: 


(713) 274-2323 


Ftp from: 


ftp.ti.com 
log in as user ftp 
cd to /mirrors/tms320bbs 


Point your browser at: 


http://www.ti.com 


Send electronic mail to: 


comments@books.sc.ti.com 


Send printed comments to: 


Texas Instruments Incorporated 
Technical Publications Mgr., MS 702 
P.O. Box 1443 

Houston, Texas 77251-1443 


Trademarks 


Trademarks 


Windows is a trademark of Microsoft Corporation. 

MS-DOS is a registered trademark of Microsoft Corporation. 
OS/2 is a trademark of International Business Machines Corp. 
Sun and SPARC are trademarks of Sun Microsystems, Inc. 
VAX and VMS are trademarks of Digital Equipment Corp. 


PAL® is a registered trademark of Advanced Micro Devices, Inc. 
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Provides examples for intializing the processor 
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Chapter 1 


Processor Initialization 


Before you execute a DSP algorithm, itis necessary to initialize the processor. 
Initialization brings the processor to a known state. Generally, initialization 
takes place any time after the processor is reset. This chapter reviews the con- 
cepts explained in the user’s guide and provides examples. 
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Reset Process 


1.1 


Reset Process 


After RESET is applied, the ’C4x jumps to the address stored in the reset vec- 
tor location and starts execution from that point. 


In order to reset the ’C4x correctly, you need to comply with several hardware 
and software requirements: 


a 


a) 


Select the reset vector location: 


The RESET vector of the ’C4x can be mapped to one of four different 
locations that are controlled by the value of the RESETLOC(1,0) pins 
at RESET. Table 1-1 shows possible reset vectors for the C40 and 
C44. 


If the DSP is in microcomputer mode (ROMEN pin =1), RESET- 
LOC(1,0) must be equal to 0,0 for the boot loader to operate correctly. 


If the DSP is in microcomputer mode, set the IIOFx pins as discussed in 
the bootloader chapter 7MS320C4x User’s Guide so that the bootloader 
works properly. 


Provide the correct reset vector value: 


The RESET vector normally contains the address of the system initial- 
ization routine. 


In microcomputer mode the reset vector is initialized automatically by 
the processor to point to the beginning of the on-chip boot loader 
code. No user action is required. 


In microprocessor mode, the reset vector is typically stored in an 
EPROM. Example 1—1 shows how you can initialize that vector. 


Apply a low level to the RESET input. (See section 1.2). 


Table 1-1. RESET Vector Locations in the 'C40 and ’C44 


Value at RESETLOCx Pin 


Get Reset Vector From 


~ RESETLOC1 | RESETLOCO | Hex Memory Address Bus 
0 0 00000 0000 Local 
0 1 O7FFF FFFFT Local 
1 0 08000 oocot Global 
1 1 OFFFF FFFFT Global 


t This corresponds to the 32-bit address that the processor accesses. However, in the ‘C44 only 
the 24-LSBs of the reset address are driven on pins AO-A23 and pins LAO-LA23. The corre- 
sponding LSTRBx pins are also activated. 


Reset Signal Generation 


1.2 Reset Signal Generation 


Several aspects of ’C4x system hardware design are critical to overall system 
operation. One such aspect is reset signal generation. 


The reset input controls initialization of internal ’C-4x logic and execution of the 
system initialization software. For proper system initialization, the RESET sig- 
nal must be applied for at least ten H1 cycles, that is, 400 ns for a’C4x operat- 
ing at 50 MHz. Upon power up, however, it can take 20 ms or more before the 
system oscillator reaches a stable operating state. Therefore, the power-up 
reset circuit should generate a low pulse on the RESET pin for 100 to 200 ms. 
Once a proper reset pulse has been applied, the processor fetches the reset 
vector from location zero, which contains the address of the system initializa- 
tion routine. Figure 1-1 shows a circuit that will generate an appropriate pow- 
er-up or push-button reset signal. 


Figure 1—1. Reset Circuit 


TMS320C4x 


Reset 
QO 


+5 V 


S 74ALS34 
Ry = 100 kQ 53 


+ 


Cy =4.7 uF APN 


i 


The voltage on the RESET pin is controlled by the RyC4 network. After a reset, 
this voltage rises exponentially according to the time constant RyC1, as shown 
in Figure 1-2. In Figure 1-1, the 74ALS34 provides a clean RESET signal to 
the ’C4x. 


Processor Initialization 1-3 


Reset Signal Generation 


Figure 1-2. Voltage on the RESET Pin 


Voltage ~ 
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The duration of the low pulse on the RESET pin is approximately ty, which is 
the time it takes for the capacitor C; to be charged to 1.5 V. This is approxi- 
mately the voltage at which the reset input switches from a logic 0 to a logic 
1. The capacitor voltage is expressed as 


V= Ved 1 = e| (5) 
where t = R4Cj is the reset circuit time constant. Solving (5) for t results in 


Vv 
t= —R,C\ln] 1 —- (6) 
cn] “| 


Setting the following: 

Ry = 100 kQ 

Cy = 4.7 uF 

Voc =5V 

V=V1,=1.5V 

results in t = 167 ms. Therefore, the reset circuit of Figure 1-1 provides a low 


pulse for along enough time to ensure the stabilization of the system oscillator 
upon powerup. 


TS 


Note: 


Reset does not have internal Schmidt hysteresis. To ensure proper reset op- 
eration, avoid low rise and fall times. Rise/fall time should not exceed one 
CLKIN cycle. 


|) 
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1.3 Multiprocessing System Reset Considerations 


If synchronization of multiple 'C4x DSPs is required, all processors should be 
provided with the same input clock and the same reset signal. After powerup, 
when the clock has stabilized, set RESET high for a few H1/H3 cycles and then 
set it low to synchronize their H1/H3 clock phases. Following the falling edge, 
RESET should remain low for at least ten H1 cycles and then be driven high. 
The circuit in Figure 1-1 can be used for RESET generation. 


Pullup resistors are recommended at each end of the connection to avoid unin- 
tended triggering after reset when RESET going lowis not received on all ’C4x 
devices at the same time. 


It is recommended that you power up the system with RESET low. This 
prevents ‘C4x asynchronous signals from driving unknown values 


before RESET goes low, which could create bus contention in 
communication-port pins, resulting in damage to the device. 
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1.4 How to Initialize the Processor 


After reset, the C4x jumps to the address stored in the reset vector location and 
starts execution from that point. The RESET vector normally contains the ad- 
dress of the system initialization routine. 


The initialization routine should typically perform several tasks: 


Set the DP register. 

Set the stack pointer. 

Set the interrupt vector table. 
Set the trap vector table. 

Set the memory control register. 
Clear/enable cache. 


UUUOUUU 


Note: 


When running under microcomputer mode (ROMEN = 1). The address 
stored in the reset vector location points to the beginning of the bootloader 
code. The on-chip bootloader automatically initializes the memory-control 
register values from the bootloader table 
ss) 
The following examples illustrate how to initialize the "C4x when using assem- 
bly language and when using C. 


Processor initialization under assembly language 


If you are running under an assembly-only environment, Example 1—1 pro- 
vides a basic initialization routine. This example shows code for initializing the 
’C4x to the following machine state: 

Timer 0 interrupt is enabled. 

Trap 0 is initialized. 

The program cache is enabled. 

The DP is initialized to point to the .text section. 

The stack pointer is initialized to the beginning of the mystack section. 
The memory control registers are initialized. 


COUOUOUUUD 


The ’C4x is initialized to run in microcontroller mode with the reset vector 
located at address 08000 0000h (RESETLOC(1,0)=1,0). 


The program has already been loaded into memory location at address = 
0x4000 0000. 


uu 


You need to allocate the section addresses using a linker command file (see 
the TMS320 Floating-Point DSP Assembly Language Tools User’s Guide 
book for more information about linker command files) as shown in 
Example 1-2. 
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Example 1—1.Processor Initialization Example 


; Create Reset Vector 


.sect "“rst_sect” ;Named section for RESET vector 
reset -word init ;RESET vector 


, 
; Create Interrupt Vector Table 


7 


_myvect .sect "myvect” ;Named section for int. vectors 
. space 2 ;Reserved space 
-word tint0O ;Timer O ISR address 


; Create Trap Vector Table 

pases -sect “mytrap” ; named section for trap vectors 
-word trap0 ;Trap O subroutine address 

: Create Stack 

cmueaok suede tnyebaek S00 ; reserve 500 locations for 


; stack 

.text 
stacka .word _mystack ; address of mystack section 
ivta -word _myvect ; address of myvect section 
tvta -word _mytrap ; address of mytrap section 
ieval -word 1 ; IE register value 
gebrl -word ??72?7????? ; target board specific 
Letrl -word ??7?7????? ; target board specific 
mctrla -word 100000h ; address of the global memory 


; control register 
init: 
, 


; Initialize the DP Register 


ldp stacka 


; Set Expansion Register IVTP 


LDI @ivta, ARO 
LDPE ARO, IVTP 


3 


; Set Expansion Register TVTP 


LDI @tvta, ARO 
LDPE ARO, TVTP 


E 
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Example 1-1. Processor Initialization Example (Continued) 


; Initialize global memory interface control 
ldi @mctrla,ar0 

LDI @gctrl,RO 

STI RO, *ARO 


; Initialize local memory interface control 


LDI @lctrl,RO 
SEL RO, *+ARO (4) 


; Initialize the Stack Pointer 


LDI @stacka, SP 


; Enable timer interrupt 
; This is equivalent to ldi 1,iie 


LDI @ieval, IIE 


Cc 


; Clear/Enable Cache and Enable Global Interrupts 
OR 3800H, ST : 


; Global interrupt enable 


BR BEGIN ; Branch to the beginning of 
; the application 


begin 
< this is your application code> 
trap0 
3 < this is your trap0O trap code> 
reti 
tintod 
< this is your tintO interrupt 
service routine> 
reti 
.end 
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Example 1-2.Linker Command File for Linking the Previous Example 


MEMORY 
{ 
EPROM: org = 0x80000000 len = 0x10 Le 
RAM: org = 0x40000000 len = 0x100 /* extend RAM */ 


/* SPECIFY THE SECTIONS ALLOCATION INTO M 


SECTIONS 
{ 


rst_sect: > 

myvect: > RA 
mystack: > 
-text: > 
mytrap: > RA 


EPROM reset vector location */ 


EMORY */ 


Processor initialization under C language 


If you are running under a C environment, your initialization routine is typically 


boot.asm (from the RTS40.LIB library that comes with the floating-point 


compiler). In addition to initializing global variables, boot.asm initializes the DP 
register (pointing to the .oss section) and the SP register (pointing to the .stack 
section). You need to enable the cache, as shown in Example 1-3, and setup 
your interrupts inside your main routine before you enable interrupts. See the 
Application Report, Setting Up TMS320 DSP Interrupts in C (SPRA0386), for 


more information. 


Example 1—3.Enabling the Cache 


main() { 

asm(” or 1800,st”) ; 
/* asm(” or 3800,st”) */ ; 
} 


enable cache 


enable cache and interrupts 


Processor Initialization 


Chapter 2 


Program Control 


Several ’C4x instructions provide program control and facilitate high-speed 
processing. These instructions directly handle: 


_j Regular and zero-overhead subroutine calls 

Lj] Software stack 

Lj Interrupts 

_j Delayed branches 

[J Single- and multiple-instruction loops without overhead 
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2.1 Subroutines 


The ’C4x provides two ways to invoke subroutine calls: regular calls and zero- 
overhead calls. The regular and zero-overhead subroutine calls use the soft- 
ware stack and extended-precision register R11, respectively, to save the re- 
turn address. The following subsections use example programs to explain how 
this works. 


2.1.1. Regular Subroutine Calls 


The ’C4x has a 32-bit program counter (PC) and a virtually unlimited software 
stack. The CALL and CALLcond subroutine calls increment the stack pointer 
and store the contents of the next value of the PC counter on the stack. At the 
end of the subroutine, RETScond performs a conditional return. 


Example 2-1 illustrates the use of a subroutine to determine the dot product 
of two vectors. Given two vectors of length N, represented by the arrays a[0], 
a[1], ..., aIN—1] and b[0], b[1],..., b[N—1], the dot product is computed from the 
expression 


d = a[0] b[0] + aft] b[1] + ... + a[N—1] b[N—1] 


Processing proceeds in the main routine to the point where the dot product is 
to be computed. It is assumed that the arguments of the subroutine have been 
appropriately initialized. At this point, a CALL is made to the subroutine, trans- 
ferring control to that section of the program memory for execution, then re- 
turning to the calling routine via the RETS instruction when execution has com- 
pleted. Note that for this particular example, it would suffice to save the register 
R2. However, a larger number of registers are saved for demonstration pur- 
poses. The saved registers are stored on the system stack, which should be 
large enough to accommodate the maximum anticipated storage require- 
ments. Other methods of saving registers could be used equally well. 
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Example 2—1.Regular Subroutine Call (Dot Product) 


* 
* TITLE REGULAR SUBROUTINE CALL (DOT PRODUCT) 
* 
* 
* MAIN ROUTINE THAT CALLS THE SUBROUTINE ‘DOT’ TO COMPUTE THE 
* DOT PRODUCT OF TWO VECTORS. 
LDI @b1k0, ARO ;ARO points to vector a 
LDI @b1k1,AR1 ;ARL points to vector b 
LDI N,RC y7RC contains the number of elements 
CALL DOT 
* 
* 


*SUBROUTINE DOT 


* 
* 
*EQUATION: d = a(0) * b(0) + a(1) * b(1) +... + a(N-1) * b(N-1) 
* 
* 


*THE DOT PRODUCT OF a AND b IS PLACED IN REGISTER RO. N MUST 
*BE GREATER THAN OR EQUAL TO 2. 


* ARGUMENT ASSIGNMENTS: 

* ARGUMENT | FUNCTION 

KO SS Se Soe as ie Sele eee Ca cae ea ees 

* ARO | ADDRESS OF a(0) 

* AR1 | ADDRESS OF b(0) 

* RC | LENGTH OF VECTORS (N) 

* 

* 

* REGISTERS USED AS INPUT: ARO, AR1, RC 

* REGISTER MODIFIED: RO 

* REGISTER CONTAINING RESULT: RO 

* 

* 
-global DOT 

* 

DOTPUSH ST ;Save status register 
PUSH R2 ;Use the stack to save R2’s 
PUSHF R2 ;bottom 32 and top 32 bits 
PUSH ARO ; Save ARO 
PUSH AR1L ; Save ARL 
PUSH RC ; Save RC 
PUSH RS 
PUSH RE 


= Initialize RO: 
MPYF3 *ARO, *AR1,RO;a(0) * b(0O) -> RO 
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Example 2—1.Regular Subroutine Call (Dot Product) (Continued) 


{| SUBF R2,R2,R2 ; Initialize R2. 
SUBI 2y RE ;Set RC = N-2 
a 
* 
* DOT PRODUCT (1 <= i < N)* 
RPTS RC ; Setup the repeat single. 
MPYF3 *++ARO(1),*++AR1(1),RO ; a(i) * b(i) -> RO 
fat ADDF3 RO,R2,R2 yj a(i-1)*b(i-1) + R2 -> R2 
* 
ADDF3 RO,R2,RO0 ; a(N-1)*b(N-1) + R2 -> RO 
* 
* 
* RETURN SEQUENCE 
* 
POP RE 
POP RS 
POP RC ;Restore RC 
POP AR1 ;Restore AR1L 
POP ARO ;Restore ARO 
POPF R2 ;Restore top 32 bits of R2 
POP R2 ;Restore bottom 32 bits of R2 
POP ST ;Restore ST 
RETS ;Return 
* 
* end 
* 
.end 


2.1.2 Zero-Overhead Subroutine Calls 


Two instructions, link and jump (LAJ) and link and jump conditional (LAJcona), 
implement zero-overhead subroutine calls to be implemented on the ’C4x. Un- 
like CALL and CALLcond, which put the value of PC + 1 into the software stack, 
LAJ and LAJcond put the value of PC + 4 into extended-precision register R11. 
Three instructions following LAJ or LAJcond are executed before going to the 
subroutine. The restriction that applies to these three instructions is the same 
as that of the three instructions following a delayed branch. At the end of the 
subroutine, you can use a delayed branch conditional, BcondD, in the register 
addressing mode with R11 as source, to perform a zero-overhead subroutine 
return. 


For comparison, the same dot product example with a zero-overhead subrou- 
tine call is given in the following example program. 
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Example 2-2. Zero-Overhead Subroutine Call (Dot Product) 


Subroutines 


* 
* TITLE ZERO-OVERHEAD SUBROUTINE CALL (DOT PRODUCT) 
* 
* 
* MAIN ROUTINE THAT CALLS THE SUBROUTINE ‘DOT’ TO COMPUTE THE 
* DOT PRODUCT OF TWO VECTORS. 
LAJ DOT 
LDI @b1k0, ARO ; ARO points to vector a 
HDL @b1k1,AR1 ; AR1L points to vector b 
LDI N,RC ; RC contains the number of elements 
* 
* SUBROUTINE DOT 
* 
* EQUATION d= a(0) * b(O) + a(1) * b(1) +... + a(N-1) * b(N-1) 
* 
* THE DOT PRODUCT OF a AND b IS PLACED IN REGISTER RO. N MUST 
* BE GREATER THAN OR EQUAL TO 2. 
* 
* ARGUMENT ASSIGNMENTS: 
*  _ARGUMEN | FUNCTION 
Wee pe a 4—-—--------— See sba sae eee 
* ARO | ADDRESS OF a(0) 
* AR1 | ADDRESS OF b(0) 
* RC | LENGTH OF VECTORS (N) 
* 
* REGISTERS USED AS INPUT: ARO, AR1, RC 
* REGISTER MODIFIED: RO 
* REGISTER CONTAINING RESULT: RO 
* 
* 
* 
-global DOT 
* 
DOT PUSH ST. ;Save status register 
PUSH R2 ;Use the stack to save R2’s 
PUSHF R2 ;bottom 32 and top 32 bits 
PUSH ARO ;Save ARO 
PUSH ARI ; Save ARL 
PUSH RC ; Save RC 
PUSH RS 
PUSH RE 
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Example 2-2. Zero-Overhead Subroutine Call (Dot Product) (Continued) 


* Initialize RO: 


7a(0) * b(0) -—> RO 


MPYF3 *ARO, *AR1, RO 
1 | SUBF R2,R2,R2 ; Initial 
SUBI 2,RC 7Set RC 
* 
* DOT PRODUCT (1 <= i < N) 
* 
RPTS RC ; Setup 
MPYF3 *++AR0(1),*++AR1(1),RO; a(i) * 
yi ADDF3 RO,R2,R2 ; a(i-1) 
* 
ADDF3 RO0,R2,R0 ; a(N-1) 
* 
* RETURN SEQUENCE 
* 
POP RE 
POP RS 
POP RC ,;Restore 
POP AR1 ,;Restore 
POP ARO ,;Restore 
BUD R11 ; Return 
POPF R2 ,Restore 
POP R2 ,Restore 
POP ST ,Restore 
* 
* end 
* 
.end 


ize R2. 
= N-2 


the repeat single 


b(i) -> RO 
*b(i-1) + R2 -> R2 
*b(N-1) + R2 -—> RO 

RC 

AR1 

ARO 


top 32 bits of R2 
bottom 32 bits of R2 
ST 


2-6 


Stacks and Queues 


2.2 Stacks and Queues 


2.2.1 


The ’C4x provides a dedicated stack pointer (SP) for building stacks in 
memory. Also, the auxiliary registers can be used to build user stacks and a 
variety of more general linear lists. This section discusses the implementation 
of the following types of linear lists: 


Stack A linear list for which allinsertions and deletions are made at one 
end of the list. 


Queue A linear list for which all insertions are made at one end of the 
list, and all deletions are made at the other end. 


Dequeue A double-ended queue linear list for which insertions and dele- 
tions are made at either end of the list. 


System Stacks 


Astack in the ’C4x fills from a low-memory address to a high-memory address, 
as is shown in Figure 2-1. A system stack stores addresses and data during 
subroutine calls, traps, and interrupts. 


The stack pointer (SP) is a 32-bit register that contains the address of the top 
of the system stack. The SP always points to the last element pushed onto the 
stack. A push performs a preincrement, and a pop performs a postdecrement 
of the SP. Provisions should be made to accommodate your software’s antici- 
pated storage requirements. 


The stack pointer (SP) can be read from as well as written to; multiple stacks 
can be created by updating the SP. The SP is not initialized by the hardware 
during reset; itis important to remember to initialize its value so that the it points 
to a predetermined memory location. Example 1—1 on page 1-7, shows how 
to initialize the SP. You must initialize the stack to a valid free memory space. 
Otherwise, use of the stack could corrupt data or program memory. 


The program counter is pushed onto the system stack on subroutine calls, 
traps, and interrupts. Itis popped from the system stack on returns. The PUSH, 
POP, PUSHF, and POPF instructions push and pop the system stack. The 
stack can be used inside of subroutines as a place of temporary storage of reg- 
isters, as is the case shown in Example 2-1, on page 2-3. 


Program Control 2-7 


Stacks and Queues 


Two instructions, PUSHF and POPF, are for floating-point numbers. These in- 
structions can pop and push floating-point numbers to registers RO — R11. This 
feature is very useful for saving the extended-precision registers (see 
Example 2-1 and Example 2—2). PUSH saves the lower 32 bits of an extended- 
precision register, and PUSHF saves the upper 32 bits. To recover this exten- 
ded-precision number, execute a POPF followed by POP. It is important to per- 
form the integer and floating-point PUSH and POP in the above order, since 
POPF forces the last eight bits of the extended-precision registers to zero. 


Figure 2—1. System Stack Configuration 


2.2.2 User Stacks 


Low Memory 


Bottom of stack 


Top of stack 


(Free) 


High Memory 


User stacks can be built to store data from low-to-high memory or from high-to- 
low memory. Two cases for each type of stack are shown. You can build stacks 
by using the preincrement/decrement and postincrement/decrement modes 
of modifying the auxiliary registers (AR). 


You can implement stack growth from high to low memory in two ways: 


Case 1: Store to memory using *- —ARnto push data onto the stack, and read 
from memory using *ARn++ to pop data off the stack. 


Case 2: Store to memory using *~ARn—- to push data onto the stack, and read 
from memory using * ++ARn to pop data off the stack. 


Figure 2-2 illustrates these two cases. The only difference is that in case 1, 
the AR always points to the top of the stack, andin case 2, the AR always points 
to the next free location on the stack. 
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Figure 2-2. Implementations of High-to-Low Memory Stacks 


Low Memory Low Memory 
(Free) (Free) 


ARn —> Top of stack Top of stack 


Bottom of stack Bottom of stack 
High Memory High Memory 
Case 1 Case 2 


You can implement stack growth from low to high memory in two ways: 


Case 3: Store to memory using *++ARnto push data onto the stack, and read 
from memory using *ARn--— to pop data off the stack. 


Case 4: Store to memory using *~ARn++ to push data onto the stack, and read 
from memory using *—-—ARn to pop data off the stack. 


Figure 2—3 shows these two cases. In case 3, the AR always points to the top 
of the stack. In case 4, the AR always points to the next free location on the 
stack. 


Figure 2-3. Implementations of Low-to-High Memory Stacks 


Low Memory Low Memory 


Bottom of stack Bottom of stack 


Top of stack Top of stack 
ARn —> 
High Memory High Memory 
Case 3 Case 4 


2.2.3 Queues and Double-Ended Queues 


The implementations of queues and double-ended queues is based upon the 
manipulation of the auxiliary registers for user stacks. 
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For queues, two auxiliary registers are used: one to mark the front of the queue 
from which data is popped and the other to mark the rear of the queue to where 
data is pushed. 


For double-ended queues, two auxiliary registers are also necessary. One 
register marks one end of the double-ended queue, and the other register 
marks the other end. Data can be popped from or pushed onto either end. 


Interrupt Examples 


2.3 Interrupt Examples 


When using interrupts, you must consider several issues. This section offers 
examples of several interrupt-related topics: 


[J Interrupt Service Routines 
J Context Switching 

Lj Interrupt-Vector Table (IVTP) 
_j Interrupt Priorities 


2.3.1. Correct Interrupt Programming 


For interrupts to work properly you need to execute the following sequence of 
steps, as is shown in Example 1-1: 


a 


Set the interrupt-vector table in a 512-word boundary. 


) 
2) Initialize the IVTP register. 
3) Create a software stack. 
4) Enable the specific interrupt. 
5) Enable global interrupts. 
6) Generate the interrupt signal. 


2.3.2 Software Polling of Interrupts 


The interrupt flag register can be polled, and action can be taken, depending 
on whether an interrupt has occurred. This is true even when maskable inter- 
rupts are disabled. This can be useful when an interrupt-driven interface is not 
implemented. Example 2-3 shows the case in which a subroutine is called 
when external interrupt 1 has not occurred. 


Example 2-3. Use of Interrupts for Software Polling 


* TITLE INTERRUPT POLLING 


TSTB 40H, IIF ;Test if interrupt 1 has occurred 
CALLZ SUBROUTINE ;If not, call subroutine 


When interrupt processing begins, the program counter is pushed onto the 
stack, and the interrupt vector is loaded in the program counter. Interrupts are 
disabled when GIE is cleared to 0 and the program continues from the address 
loaded in the program counter. Because all maskable interrupts are disabled, 
interrupt processing can proceed without further interruption unless the inter- 
rupt service routine re-enables interrupts, or the NMI occurs. 
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2.3.3 Using One Interrupt for Two Services 


The IVTP can be changed to point to alternate interrupt-vector tables. This re- 
locatable feature of the table allows you to use a single interrupt signal for more 
than one service. 


In Example 2-4, the IVTP is reset in the external INTO interrupt service rou- 
tines EINTOA and EINTOB. After the value of the IVTP is changed, the CPU 
goes to a different interrupt service routine when the same interrupt signal re- 
occurs. 


Example 2-4. Use of One Interrupt Signal for Two Different Services 
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TITLE USE OF ONE INTERRUPT SIGNAL FOR TWO DIFFERENT SERVICES 


[IN THIS EXAMPLE, THE ADDRESS OF EINTOA AND EINTOB ARE IN 
MEMORY LOCATION 03H AND 1003H, RESPECTIVELY. ASSUME THE IVIP 
HAS NOT BEEN CHANGED AFTER DEVICE RESET AND THE EXTERNAL 

L ED. WHEN THE FIRST IIOFQ INTERRUPT 
SIGNAL COMES IN, THE EINTOA ROUTINE WILL BE EXECUTED AND THEN 
[F THE NEXT IIOFO INTERRUPT SIGNAL OCCURS, THE EINTOB ROUTINE 
WILL BE EXECUTED, AND SO ON. THE EINTOA AND EINTOB ROUTINES 
WILL TAKE TURNS TO BE EXECUTED WHEN THE IIOFO INTERRUPT 
SIGNAL OCCURS. 


[INTERRUPT IIOFO IS ENABLI 


External IIOFO interrupt service routine A 


+ + + FF FF F FF F FF F F F HF 


-global EINTOA 


EINTOA: 


LDI 1000H, RO ;Change IVTP to point to 1000H 
LDPE RO, IVTP 

* 
RETI ;Return and enable interrupts 


* External IIOFO interrupt service routine A 


-global EINTOB 


EINTOB: 


LDI 0, RO ;Change IVTP to point to 0 
LDPE RO, IVTP 

* 
RETI ;Return and enable interrupts 


Interrupt Examples 


2.3.4 Nesting Interrupts 


In Example 2-5, the interrupt service routine for INT2 temporarily modifies the 
interrupt enable register (IIE) and interrupt flag register (IIF) to permit interrupt 
processing when an interrupt to INTO or NMI (but no other interrupt) occurs. 
When the routine finishes processing, the IIE register is restored to its original 
state. Notice that the RETIcond instruction not only pops the next program 
counter address from the stack, but also restores GIE and CF bits from the 
PGIE and PCF bits. This re-enables all interrupts that were enabled before the 
INT2 interrupt was serviced. 


Example 2-5. Interrupt Service Routine 


a TITLE INTERRUPT SERVICE ROUTINE 
-global ISR2 


ENABLE .set 2000h 
ASK .set 9h 
* 
* INTERRUPT PROCESSING FOR EXTERNAL INTERRUPT INT2- 
* 
ISR2: 
PUSH ST ;Save status register 
PUSH DP ;Save data page pointer 
PUSH IIE ;Save interrupt enable register 
PUSH TIF 
PUSH RO ;Save lower 32 bits and 
PUSHF RO ;upper 32 bits of RO 
PUSH R1 ;Save lower 32 bits and 
PUSHF R1 ;upper 32 bits of R1 
LDI 0, 1IE ;Unmask all internal interrupts 
LDI ASK, RO 
HO RO, IIF ;Enable INT2 
OR ENABLE,ST ;Enable all interrupts 


* 


* MAIN PROCESSING SECTION FOR ISR2 


XOR ENABLE,ST ;Disable all interrupts 
POPF RL ;Restore upper 32 bits and 
POP R1 ;lower 32 bits of R1 
POPF RO ;Restore upper 32 bits and 
POP RO j;lower 32 bits of RO 
POP IIF 
POP IIE ;Restore interrupt enable register 
POP DP ;Restore data page register 
POP ST ;Restore status register 

* 
RETI ;Return and enable interrupts 
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2.4 Context Switching in Interrupts and Subroutines 
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Context switching is commonly required when a subroutine call or interrupt is 
processed. It can be extensive or simple, depending on system requirements. 
For the ’C4x, the program counter is automatically pushed onto the stack. Im- 
portant information in other ’C4x registers, such as the status, auxiliary, or ex- 
tended-precision registers, must be saved in the stack with PUSH/PUSHF and 
recovered later with POP/POPF instructions. 


You need to preserve only the registers that are modified inside of your subrou- 
tine or interrupt/trap service routine and that could potentially affect the pre- 
vious context environment. 


oe, 
Note: 


The status register should be saved first and restored last to preserve the 
processor status without any further change caused by other context-switch- 
ing instructions. 


If the previous context environment was in C, then your program must perform 
one of two tasks: 


(1 Ifthe program is in a subroutine, it must preserve the dedicated C regis- 


ters: 
Save as integers Save as floating-point 
R4 RS R6 R7 
AR4 AR5 
AR6 AR7 
FP DP (small model only) 
SP R8 (‘C4x only) 


Lj Ifthe program is in an interrupt service routine, it must preserve all of the 
’C4x registers, as Example 2-6 shows. 


If the previous context environment was in assembly language, you need to 
determine which registers you must save based on the operations of your as- 
sembly-language code. 


Context Switching in Interrupts and Subroutines 


Example 2-6. Context Save and Context Restore 


-global ISR1 

* 

* TOTAL CONTEXT SAVE ON INTERRUPT. 

* 

ISR1: PUSH ST ;Save status register 

* 

* SAVE THE EXTENDED PRECISION REGISTERS 

* 
PUSH RO ;Save the lower 32 bits of RO 
PUSHF RO ;and the upper 32 bits 
PUSH R1 ;Save the lower 32 bits of Rl 
PUSHF R1 ;and the upper 32 bits 
PUSH R2 ;Save the lower 32 bits of R2 
PUSHF R2 ;and the upper 32 bits 
PUSH R3 ;Save the lower 32 bits of R3 
PUSHF R3 ;and the upper 32 bits 
PUSH R4 ;Save the lower 32 bits of R4 
PUSHF R4 ;and the upper 32 bits 
PUSH R5 ;Save the lower 32 bits of R5 
PUSHF R5 ;and the upper 32 bits 
PUSH R6 ;Save the lower 32 bits of R6 
PUSHF R6 ;and the upper 32 bits 
PUSH R7 ;Save the lower 32 bits of R7 
PUSHF R7 ;and the upper 32 bits 
PUSH R8 ;Save the lower 32 bits of R8 
PUSHF R8 ;and the upper 32 bits 
PUSH R9 ;Save the lower 32 bits of R9 
PUSHF RI ;and the upper 32 bits 
PUSH R10 ;Save the lower 32 bits of R10 
PUSHF R10 ;and the upper 32 bits 
PUSH R11 ;Save the lower 32 bits of R11 
PUSHF R11 ;and the upper 32 bits 

* 

* SAVE THE AUXILIARY REGISTERS 

* 
PUSH ARO ;Save ARO 
PUSH AR1L ;Save AR1 
PUSH AR2 ;Save AR2 
PUSH AR3 ;Save AR3 
PUSH AR4 ;Save AR4 
PUSH ARS ;Save ARS 
PUSH AR6 ;Save AR6 
PUSH AR7 ;Save AR7 

* 
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Example 2-6. Context Save and Context Restore (Continued) 


* SAVE THE REST OF THE REGISTERS FROM THE REGISTER FILE 
* 
PUSH DP ;Save data page pointer 
PUSH IRO ;Save index register IRO 
PUSH IRL ;Save index register IR1 
PUSH BK ;Save block-size register 
PUSH IIE ;Save interrupt enable register 
PUSH IIF ;Save interrupt flag register 
PUSH DIE ;Save DMA interrupt enable register 
PUSH RS ;Save repeat start address 
PUSH RE ;Save repeat end address 
PUSH RC ;Save repeat counter 
* 
* SAVE IS COMPLETE 
* 
* 
x YOUR INTERRUPT SERVICE ROUTINE CODE GOES HERE* 
-global RESTR 
* 
* CONTEXT RESTORE AT THE END OF A SUBROUTINE CALL OR 
INTERRUP 
RESTR: 
* 
* RESTORE THE REST REGISTERS FROM THE REGISTER FILE 
* 
POP RC 7;ReStore repeat. counter 
POP RE ;Restore repeat end address 
POP RS ;Restore repeat start address 
POP DIE ;Restore DMA interrupt enable register 
POP IIF ;Restore interrupt flag register 
POP IIE ;Restore interrupt enable register 
POP BK ;Restore block-size register 
POP IRL ;Restore index register IR1 
POP IRO ;Restore index register IRO 
POP DP ;Restore data page pointer 
* 
* RESTORE THE AUXILIARY REGISTERS 
* 
POP AR7 ;Restore AR7 
POP AR6 ;Restore AR6 
POP AR5 ;Restore ARD5 
POP AR4 ;Restore AR4 
POP AR3 ;Restore AR3 
POP AR2 ;Restore AR2 
POP AR1L ;Restore AR1L 
POP ARO ;Restore ARO 
* 
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Example 2-6. Context Save and Context Restore (Continued) 


RESTORE THE EXTENDED PRECISION REGISTERS 

POPF R11 ;Restore the upper 32 bits and 
POP Ril ;the lower 32 bits of R11 
POPF R10 ;Restore the upper 32 bits and 
POP R10 ;the lower 32 bits of R10 
POPF RY ;Restore the upper 32 bits and 
POP RI ;the lower 32 bits of R9 
POPF R8 ;Restore the upper 32 bits and 
POP R8 ;the lower 32 bits of R8 
POPF R7 ;Restore the upper 32 bits and 
POP R7 ;the lower 32 bits of R7 
POPF R6 ;Restore the upper 32 bits and 
POP R6 ;the lower 32 bits of R6 
POPF R5 ;Restore the upper 32 bits and 
POP R5 ;the lower 32 bits of R5 
POPF R4 ;Restore the upper 32 bits and 
POP R4 ;the lower 32 bits of R4 
POPF R3 ;Restore the upper 32 bits and 
POP R3 ;the lower 32 bits of R3 
POPF R2 ;Restore the upper 32 bits and 
POP R2 ;the lower 32 bits of R2 
POPF R1 ;Restore the upper 32 bits and 
POP R1 ;the lower 32 bits of R1 
POPF RO ;Restore the upper 32 bits and 
POP RO ;the lower 32 bits of RO 
POP ST ;Restore status register 

* 

* RESTORE IS COMPLETE 

* 
RETI 
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2.5 Repeat Modes 


2.5.1. Block Repeat 


The RPTB, RPTBD, and RPTS instructions support looping without overhead. 
Loop execution parameters are specified by three registers, as can be seen 
in the following examples: 


1 RS (Repeat start address) 
_j RE (Repeat end address) 
1 RC (Repeat counter) 


In principle, it is possible to nest repeat blocks. However, there is only one set 
of control registers: RS, RE, and RC. It is, therefore, necessary to save these 
registers before entering an inside loop and to restore these registers after 
completing the inside loop. It takes four cycles of overhead to save and restore 
these registers. Hence, sometimes it may be more economical to implement 
a nested loop by the more traditional method of using a register as a counter 
and then using a delayed branch, rather than by using the nested repeat block 
approach. Often, implementing the outer loop as a counter and the inner loop 
as a RPTB/RPTBBD instruction produces the fastest execution. 


Example 2—7 shows the use of the block repeat to find the maximum or the 
minimum value of 147 numbers. The elements of the array are either all 
positive or all negative numbers. Because the loop cannot be predetermined, 
the RPTBD instruction is not suitable here. 


Example 2-7. Use of Block Repeat to Find a Maximum or a Minimum 


+ + FF 


TITLE USE OF BLOCK REPEAT TO FIND A MAXIMUM OR A MINIMUM 


THIS ROUTINE FINDS MAXIMUM OR MINIMUM OF N=147 NUMBERS 


LDI 146,RC ;Initialize repeat counter to 147-1 
LDI @ADDR, ARO ;ARO points to beginning of array 
LDF *ARO++(1),RO jInitialize MAX or MIN to first value 
BLT LOOP2 ,If negative array, find minimum 


LOOP1 RPTB MAX 
CMPF *ARO, RO 
MAX LDFLT *ARO, RO 
B NEXT 
LOOP2 RPTB MIN 
CMPF *ARO++ (1) ,RO 
MIN LDFLT *—-ARO (1) ,RO 


;Compare number to the maximum 
;If greater, this is a new maximum 


;Compare number to the minimum 
;If smaller, this is new minimum 
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2.5.2 Delayed Block Repeat 


Example 2-8 shows an application of the delayed block-repeat construct. In 
this example, an array of 64 elements is flipped over by exchanging the ele- 
ments that are equidistant from the end of the array. In other words, if the origi- 
nal array is: 


a(1), a(2),..., a(31), a(32),..., a(63), a(64); 
then the final array after the rearrangement is: 
a(64), a(63),..., a(32), a(31),..., a(2), a(1). 


Because the exchange operation is performed on two elements at the same 
time, it requires 32 operations. The repeat counter (RC) is initialized to 31. In 
general, if RC contains the number N, the loop is executed N + 1 times. In the 
example, the loop begins at the fourth instruction following the RPTBD instruc- 
tion (at the EXCH label). RC should not be initiated in the next three instruc- 
tions following the RPTBD. 


Example 2—8.Loop Using Delayed Block Repeat 


* TITLE LOOP USING DELAYED BLOCK REPEAT 
* 
as THIS CODE SEGMENT EXCHANGES THE VALUES OF ARRAY 
bed ELEMENTS THAT ARE SYMMETRIC AROUND THE MIDDLE OF THE 
* ARRAY. 
* 
LDI 31,RC ;Initialize repeat counter 
* 
RPTBD EXCH ;Repeat RC + 1 times between 
; START and EXCH 
LDI @ADDR, ARO ;ARO points to 
beginning of array 
LDI ARO, AR1 
ADDI 63,AR1 ;AR1L points to the end of the 
array 
* 
* The loop starts here 
START LDI *ARO, RO ;Load one memory element in RO, 
| | LDI *AR1,R1 ;and the other in Rl 


EXCH STI Rl, *ARO++(1) ;Then, exchange their locations 
| | STI RO, *AR1-- (1) 
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2.5.3 Single-Instruction Repeat 


Example 2-9 shows an application of the repeat-single construct. In this ex- 
ample, the sum of the products of two arrays is computed. The arrays are not 
necessarily different. If the arrays are a(i) and b(i), and if each is of length 

N = 512, register R2 contains the following quantity: 


a(1) b(1) + a(2) b(2) +...+ a(N) b(N). 


The value of the repeat counter (RC) is specified to be 511 in the instruction. 


Example 2-9.Loop Using Single Repeat 


TITLE LOOP USING SINGLE REPEAT 


LDI @ADDR1, ARO ;ARO points to array a(i) 


LDI @ADDR2, AR1 ;AR1 points to array b(i) 
* 
LDF 0.0,R2 ; Initialize RO 


MPYF3 *ARO++(1),*AR1++(1),R1 ;Compute first product 
RPTS oe ak ;Repeat 512 times 


MPYF3 *ARO++(1),*AR1++(1),R1 ;Compute next product and 
1 | ADDF3 R1,R2,R2 ;accumulate the previous 


ADDF R1,R2 ;One final addition 


2-20 


Computed GOTOs to Select Subroutines at Runtime 


2.6 Computed GOTOs to Select Subroutines at Runtime 


Occasionally, itis convenient to select during runtime, not during assembly, the 
subroutine to be executed. The ’C4x’s computed GOTO supports this selec- 
tion. You can implement the computed GOTO by using the CALLcond instruc- 
tion in the register addressing mode. This instruction uses the contents of the 
register as the address of the call. Example 2-10 shows the case of a task con- 
troller. 


Example 2-10. Computed GOTO 


* TITLE COMPUTED GOTO 
* 
* TASK CONTROLLER 
* 
ie THIS MAIN ROUTINE CONTROLS THE ORDER OF TASK EXECUTION 
* (6 TASKS IN THE PRESENT EXAMPLE). TASKO THROUGH TASK5 ARE 
bs THE NAMES OF SUBROUTINES TO BE CALLED. THEY ARE EXECUTED 
* IN ORDER, TASKO, TASK1, ... TASK5. WHEN AN INTERRUPT 
* OCCURS, THE INTERRUPT SERVICE ROUTINE IS EXECUTED, AND THE 
bs PROCESSOR CONTINUES WITH THE INSTRUCTION FOLLOWING THE 
Ls IDLE INSTRUCTION. THIS ROUTINE SELECTS THE APPROPRIATE 
* TASK FOR THE CURRENT CYCLE, CALLS THE TASK AS A SUBROUTINE, 
i AND BRANCHES BACK TO THE IDLE INSTRUCTION TO WAIT FOR THE 
* NEXT SAMPLE INTERRUPT WHEN THE SCHEDULED TASK HAS COMPLETED 
* EXECUTION. RO HOLDS THE OFFSET FROM THE BASE ADDRESS OF THE 
bs TASK TO BE EXECUTED. BIT 15 (SET COND BIT) OF STATUS REGISTER 
* (ST) SHOULD BE SET TO 1. 
* 
LDI 5,IRO ;Initialize IRO 
LDI @ADDR, AR1L ;AR1 holds the base address 
7,of the table 
WAIT DLE ;Wait for the next interrupt 
ADDI *+AR1(IRO),R1 ;Add base address to the 
;table entry number 
SUBI 1, IRO ;Decrement IRO 
LDILT 5,IRO ;If IRO<0O, reinitialize it to 5 
CALLU Rl ;Execute appropriate task 
BR WAIT 
* 
TSKSEQ .word TASK5 ;Address of TASK5 
-word TASK4 ;Address of TASK4 
-word TASK3 ;Address of TASK3 
.word TASK2 ;Address of TASK2 
-word TASK1 ;Address of TASK1 
-word TASKO ;Address of TASKO 
ADDR -word TSKSEQ 
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Logical and Arithmetic Operations 


The ’C4x instruction set supports both integer and floating-point arithmetic and 
logical operations. The basic functions of such instructions can be combined 
to form more complex operations. This chapter contains the following opera- 
tions examples: 


_j Bit manipulation 

Lj Block moves 

_j Byte and half-word manipulation 

_j Bit-reversed addressing 

Lj Integer and floating-point division 

C1 Square root 

_j Extended-precision arithmetic 

J Floating-point format conversion between IEEE and ’C4x formats 
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3.1. Bit Manipulation 


Instructions for logical operations, such as AND, OR, NOT, ANDN, and XOR, 
can be used together with shift instructions for bit manipulation. A special 
instruction, TSTB, tests bits. TSTB does the same operation as AND, but the 
result of the TSTB is used only to set the condition flags and is not written any- 
where. Example 3-1 and Example 3—2 demonstrate the use of several in- 
structions for bit manipulation and testing. 


Example 3—1.Use of TSTB for Software-Controlled Interrupt 


* TITLE USE OF TSTB FOR SOFTWARE-CONTROLLED INTERRUPT 

* 

* iT: HIS EXAMPLE, ALL INTERRUPTS HAVE BEEN DISABLED BY 

* RESETTING HE GIE BIT OF HE STATUS REGISTER. WHEN AN 

a INTERRUPT ARRIVES, I IS STORED I THE IF REGISTER. THE 

= PRESEN EXAMPLE ACTIVATES HE INTERRUPT SERVICE ROUTINE INTR 

* WHE IT DETECTS THAT INT2- HAS OCCURRED. 
TSTB 4,11F ; Check if bit 2 of IF is set, 
CALLNZ INTR ; and, if so, call subroutine INTR 


Example 3-2. Copy a Bit from One Location to Another 


* TITLE COPY A BIT FROM ONE LOCATION TO ANOTHER 
* 
= BI I OF Rl NEEDS TO BE COPIED TO BIT J OF R2. ARO POINTS TO A LOCATION 
* HOLDING I, AND IT IS ASSUMED THAT THE NEX EMORY LOCATION HOLDS THE VALUE J. 
ae 

LDI 1,R0 

LSH *ARO,RO ;Shift 1 to align it with bit I 

TSTB R1,RO ;Test the I-th bit of Rl 

BZD CONT ;If bit = 0, branch delayed 

LDI 17RO 

LSH *+ARO(1),RO ;Align 1 with J-th location 

ANDN RO,R2 ;If bit = 0, reset J-th bit of R2 

OR RO, R2 ;If bit = 1, set J-th bit of R2 

CONT 
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3.2 Block Moves 


Because the ’C4x directly addresses a large amount of memory, blocks of data 
or program code can be stored off-chip in slow memories and then loaded 
on-chip for faster execution. Data can also be moved from on-chip memory to 
off-chip memory for storage or for multiprocessor data transfers. 


The DMA can transfer data efficiently in parallel with CPU operations. Alterna- 
tively, you can use the load and store instructions in a repeat mode to perform 
data transfers under program control. Example 3-3 shows how to transfer a 
block of 512 floating-point numbers from external memory to block 1 of on-chip 
RAM. 


Example 3—3.Block Move Under Program Control 


* TITLE BLOCK MOVE UNDER PROGRAM CONTROL 
* 
extern .word 01000H 
blockl .word O2FFCOOH 
LDI @extern, ARO ;Source address 
LDI @block1,AR1 ;Destination address 
LDF *ARO++, RO ;Load the first number 
RPTS 510 ;Repeat following instruction 511 times 
LDF *ARO++, RO ;Load the next number, and... 
|| STF RO, *AR1++ ;store the previous one 
STF RO, *AR1 ;Store the last number 
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3.3. Byte and Half-Word Manipulation 


Example 3-4. Use of Packing Data From Half-Word FIFO to 32-Bit Data Memory 


A set of instructions for byte and half-word accessibility, such as LB(3,2,1,0), 
LBU(3,2,1,0), LH(1,0), LHU(1,0), LWL(0,1,2,3), LWR(0,1,2,3), MB(3,2,1,0), 
and MH(1,0), is available on the ’C4x. In an application such as image process- 
ing, it is often important to be able to manipulate packed data. For example, 
the pixels in color images are often represented by four 8-bit unsigned quanti- 
ties — red, green, blue and aloha — which are packed into a single 32-bit 
word. The byte and half-word instruction makes it very easy to manipulate this 


packed data. 


Example 3—4 shows the packing of data from a half-word FIFO to 32-bit data 
memory, and Example 3-5 shows the unpacking of a 32-bit data array into a 
4-byte-wide data array (assuming the 32-bit data array contains four 8-bit un- 


signed numbers). 


x ITLE USE OF PACKING DATA FROM HALF-WORD FIFO 
* TO 32-BIT DATA MEMORY 
* 
x IN HIS EXAMPLE, EVERY TWO INPUT 16 BITS DATA HAS BEEN 
= PACKED INTO ONE 32-BIT DATA MEMORY. THE LOOP SIZE 
* USED HERE IS ARRAY SIZE, NOT THE INPUT DATA LENGTH. 
LDI size-1,RC ;Load array size 
RPTBD PACK 
LDI @fifo_adr,ARI1 ;Load fifo address 
LDI @array,AR2 ;Load data array address 
OP 
* > >>>>>>>>>>>>>>> ;Loop starts here 
WLO *AR1,R9 ;Pack 16 LSBs 
WL1 *AR1,R9 ;Pack 16 MSBs 
PACK STI R9, *AR2++ (1) ;Store the data 
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Example 3-5. Use of Unpacking 32-Bit Data Into Four-Byte-Wide Data Array 


* ITLE USE OF UNPACKING 32-BIT DATA INTO FOUR BYTE-WIDE 
* DATA ARRAY 
* 
* HIS EXAMPLE ASSUMED THAT THE 32-BIT DATA CONTAINS FOUR 8-BIT 
* UNSIGNED DATA. 
: LDI size-1,RC ;Load array size 
LDI @input_adr, ARO ; Load RPTBD UNPACK input address 
LDI @arrayl,AR1 ;Load output data array 1 address 
RPTBD UNPACK 
LDI @array2,AR2 ;Load output data array 2 address 
LDI @array3,AR3 ;Load output data array 3 address 
LDI @array4, AR4 ;Load output data array 4 address 
* > D>D>>>>>>>>>>>>>> ;Loop starts here 
LBUO *ARO,R8 ;Unpack first byte 
STI R8, *AR1++ (1) 
LBUL *ARO,R8 ;Unpack second byte 
STI R8, *AR2++ (1) 
LBU2 *ARO,R8 ;Unpack third byte 
STI R8, *AR3++ (1) 
LBU3 *ARO++(1),R8 ;Unpack fourth byte 
UNPACK STI R8, *AR4++ (1) 
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3.4 Bit-Reversed Addressing 


The ’C4x can implement fast Fourier transforms (FFT) with bit-reversed ad- 
dressing. If the data to be transformed is in the correct order, the final result 
of the FFT is scrambled in bit-reversed order. To recover the frequency-do- 
main data in the correct order, certain memory locations must be swapped. 
The bit-reversed addressing mode makes swapping unnecessary. The next 
time data is accessed, the access is bit-reversed rather than sequential. In 
’C4x, this bit-reversed addressing can be implemented through both the CPU 
and DMA. 


For correct CPU or DMA bit-reversed operation, the base address of bit-re- 
versed addressing must be located on a boundary of the size of the table. To 
clarify this point, assume an FFT of size N = 2". When real and imaginary data 
are stored in separate arrays, the n LSBs of the base address must be zero, 
(0) and IRO must be initialized to 2.1 (half of the FFT size). When real and 
imaginary data are stored in consecutive memory locations (Re—/m—Re—/m) 
the n+71 LSBs of the base address must be zero (0), and IRO must be equal 
to IRO = 2" =N (FFT size). 


3.4.1. CPU Bit-Reversed Addressing 
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One auxiliary register (ARO, in this case) points to the physical location of a 
data value. When you add IRO to the auxiliary register by using bit-reversed 
addressing, addresses are generated in a bit-reversed fashion (reverse carry 
propagation). The largest index (IRO, in this case) for bit reversing is OOFF 
FFFFh. 


Example 3-46 illustrates how to move a 512-point complex FFT from the place 
of computation (pointed at by ARO) to a location pointed at by AR1. Reads are 
executed in a bit-reversed fashion and writes in a linear fashion. In this exam- 
ple, real and imaginary parts XR(i) and XI(i) of the data are not stored in sepa- 
rate arrays, but they are interleaved with XR(0), XI(0), XR(1), XI(1), ..., XR(N1), 
XI(N1). Because of this arrangement, the length of the array is 2N instead of 
N, and IRO is set to 512 instead of 256. 
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Example 3—6.CPU Bit-Reversed Addressing 


ITLE BIT-REVERSED ADDRESSING 


+ + FF HF 


HIS EXAMPLE OVES THE RESULT OF THE 512-POINT FFT COMPUTATION, POINTED AT BY 
ARO, TO A LOCATION POINTED AT BY AR1. REAL AND IMAGINARY POINTS ARE ALTERNATING. 
LDI 511,RC ;Repeat 511+1 times 
RP TBD LOOP 
LDI 512, 1IRO ;Load FFT size 
LDI 2,IR1 
LDF *+AR0O(1),R1 ;Load first imaginary point 
* 
LDF *ARO++ (IRO)B, RO ;Load real value (and point to next 
|| STF R1, *+ARI1 (1) ;location) and store the imaginary value 
LOOP LDF *+AR0O(1),R1 ;Load next imaginary point and store 
|| STF RO, *AR1++(IR1) ;previous real value 
3.4.2. DMA Bit-Reversed Addressing 


In DMA bit-reversed addressing, two bits in the DMA control register enable 
bit-reversed addressing on DMA reads (READ BIT REV) and DMA writes 
(WRITE BIT REV). The source address index register and destination address 
index register define the size of the bit-reversed addressing. Their function is 
similar to the CPU index register IRO described in the previous subsection. 
Two DMA block transfers are required when the DMA is used for bit-reversed 
transfer of complex numbers: one to transfer the real ports and one to transfer 


the imaginary ports. 


Figure 3-1 illustrates the DMA settings required for a DMA operation equiva- 
lent to Example 3-6. Unified-autoinitialization mode and bit-reversed read are 
used. For more detailed information about DMA operation, refer to The DMA 


Coprocessor in the TMS320C4x User’s Guide. 
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Figure 3—1. DMA Bit-Reversed Addressing 


Control Register 00CO 1009h 


src Address ARO 


src Index IRO 


Counter 512 


dst Address 


dst Index 


Link Pointer 
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-—> label 


00C0 1005h 


ARO+1 


IRO 
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3.5 Integer and Floating-Point Division 


You can use the single-cycle instruction, RCPF, to generate an estimate of the 
reciprocal of a floating-point number. This estimate has the correct exponent, 
and the mantissa is accurate to the eighth binary place (the error of the mantis- 
sais < 2-8). Often, this is a satisfactory estimate of the reciprocal of a floating- 
point number. In other cases, this estimate can be used as a seed for an algo- 
rithm that computes the reciprocal to even greater accuracy. The Newton- 
Raphson algorithm described later is one such case. 


Although it provides no special instruction for integer division, the instruction 
set can perform an efficient division routine. Additionally, the FLOAT, RCPF, 
and FIX instructions can produce a rough estimate. 


3.5.1. Integer Division 


You can implement division on the ’C4x by repeating SUBC, a special condi- 
tional subtract instruction. Consider the case of a 32-bit positive dividend with 
i significant bits (and 32-/ sign bits), and a 32-bit positive divisor with / signifi- 
cant bits (and 32-/sign bits). The repetition of the SUBC command /-/ + 7 times 
produces a 32-bit result in which the lower /-/ + 7 bits are the quotient, and the 
upper 31-/ + / bits are the remainder of the division. 


SUBC implements binary division in the same manner as long division. The 
divisor (assumed to be smaller than the dividend) is shifted left -/times to align 
with the dividend. Then, using SUBC, the shifted divisor is subtracted from the 
dividend. For each subtract that does not produce a negative answer, the divi- 
dend is replaced by the difference. It is then shifted to the left, and the LSB is 
set to 1. If the difference is negative, the dividend is simply shifted left by one. 
This operation is repeated /-+/ + 7 times. 


As an example, consider the division of 33 by 5 using both long division and 
the SUBC method. In this case, i = 6, j = 3, and the SUBC operation is repeated 
6-3 + 1 = 4 times. 


LONG DIVISION: 


Quotient 
00000000000000000000000000000110 


00000000000000000000000000000101 00000000000000000000000000100001 
-101 


1101 
-101 


Remainder 11 
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SUBC METHOD: 


00000000000000000000000000100001 


00000000000000000000000000101000 


Negative difference 


L 


00000000000000000000000000100001 


00000000000000000000000000101000 


00000000000000000000000000011010 


00000000000000000000000000100001 


00000000000000000000000000101000 


00000000000000000000000000011010 


00000000000000000000000000011011 


00000000000000000000000000101000 


Negative difference 


J 


00000000000000000000000000110110 


y y 


Remainder Quot 


Dividend 

Divisor (aligned) 
(1st SUBC com- 
mand) 


New Dividend + Quotient 
Divisor 

Difference (>0) (2nd SUBC 
command) 


New Dividend + Quotient 
Divisor 

Difference (>0) (8rd SUBC 
command) 


New Dividend + Quotient 
Divisor 
(4th SUBC command) 


Final Result 


When the SUBC command is used, both the dividend and the divisor must be 
positive. Example 3-7 shows a realization of the integer division in which the 
sign of the quotient is properly handled. The last instruction before returning 
modifies the condition flag, in case subsequent operations depend on the sign 


of the result. 


Integer and Floating-Point Division 


Example 3-7. Integer Division 


* 

* TITLE INTEGER DIVISION 

* 

7 SUBROUTINE DIVI 

* 

* 

* INPUTS: SIGNED INTEGER DIVIDEND IN RO, 
% SIGNED INTEGER DIVISOR IN Rl. 
* 

* OUTPUT: RO/R1 into RO. 

* 

* REGISTERS USED: RO-R3, IRO, IRL 

* 

* OPERATION: 1. NORMALIZE DIVISOR WITH DIVIDEND 
* 2. REPEAT SUBC 

7 3. QUOTIENT IS IN LSBs OF RESULT 
* 

* CYCLES: 31-62 (DEPENDS ON AMOUNT OF NORMALIZATION) 
* -globl DIVI 

SIGN -set R2 

EMPE .set R3 

EMP -set IRO 

COUNT -set TRA 

¥ DIVI - SIGNED DIVISION 
DIVI: 


* 


* DETERMINE SIGN OF RESULT. GET ABSOLUTE VALUE OF OPERANDS. 
* 


XOR RO,R1,SIGN ;Get the sign 
ABSI RO 
ABSI R1 
CMP TI RO,R1 ;Divisor > dividend ? 
BGTD ZERO ;If so, return 0 
* 
* NORMALIZE OPERANDS. USE DIFFERENCE IN EXPONENTS AS SHIFT COUNT 
¥ FOR DIVISOR, AND AS REPEAT COUNT FOR ’SUBC’. 
* 
FLOAT RO, TEMPE ;Normalize dividend 
PUSHE EMPE ,PUSH as float 
POP COUNT ;POP as int 
LSH -24, COUNT ;Get dividend exponent 
FLOAT R1, TEMPE ;Normalize divisor 
PUSHE EMPE ;PUSH as float 
POP EMP ;POP as int 
LSH -—24,TEMP ,Get divisor exponent 
SUBI EMP , COUNT ;Get difference in exponents 
LSH COUNT, R1 ;Align divisor with dividend 
* 
* DO COUNT+1 SUBTRACT & SHIFTS. 
* 
RPTS COUNT 
SUBC R1,RO 
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Example 3-7. Integer Division (Continued) 


* “MASK OFF THE LOWER COUNT+1 BITS OF RO 
* 
SUBRI 31, COUNT +Shaft count is (32 - (COUNT+1)) 
LSH COUNT, RO ;Shift left 
EGI COUNT 
LSH COUNT, RO ;Shift right to get result 
* 
* CHECK SIGN AND NEGATE RESULT IF NECESSARY. 
* 
EGI RO,R1 ;Negate result 
ASH —31,SIGN ;Check sign 
LDINZ R1,RO ;If set, use negative result 
CMPI 0,RO ;Set status from result RETS 
* 
* RETURN ZERO 
ae 
ZERO: 
LDI 0,RO 
RETS 
.end 


If the dividend is less than the divisor and you want fractional division, you can 
perform a division after you determine the desired accuracy of the quotient in 
bits. If the desired accuracy is k bits, start by shifting the dividend left by k posi- 
tions. Then apply the algorithm described above, and replace with / + k. It is 
assumed that / + kis less than 32. 


3.5.2 Computation of Floating-Point Inverse and Division 


When you use the RCPF (reciprocal of a floating-point number) instruction to 
generate an estimate of the reciprocal of a floating-point number, you can also 
use Newton-Raphson algorithm to extend the precision of the mantissa of the 
reciprocal of a floating-point number that the instruction generates. The floa- 
ting-point division can be obtained by multiplying the dividend and the recipro- 
cal of the divisor. 


The input to RCPF is assumed to be v = v(man) x 2V(€XP). The output is x = 
x(man) x 2 X(€XP), The value v(man) (or x(man)) is composed of three fields: 
the sign bit v(sign), an implied nonsign bit, and the fraction field v(frac). 


Four rules apply to generating the reciprocal of a floating-point number: 


1) Ifv>0, then x(exp) =—v(exp) — 1, and x(man) = 2/v(man). 
For the special case in which the 10 MSBs of v(man) = 01.00000000b, 
then x(man) = 2—2 -8 = 01.11111111b. In both cases, the 23 LSBs of 
x(frac) = 0. 


2) Ifv <0, then x(exp) = —v(exp) — 1, and x(man) = 2/v(man). 
For the special case in which the 10 MSBs of v(man) = 10.00000000b, 


Integer and Floating-Point Division 


then x(man) = —1 — 2-8 = 10.11111111b. In both cases, the 23 LSBs of 
x(frac) = 0. 


3) Ifv=0(v(exp) =—128 ), then x(exp) = 127, and 
x(man) = 01.111114991414111111199911111111111b. 
In other words, if v = 0, then x becomes the largest positive number repre- 
sentable in the extended-precision floating-point format. The overflow flag 
(V) is set to 1. 


4) If v(exp) = 127, then x(exp) = —128, and x(man) = 0. 
The zero flag (Z) is set to 1. 


The Newton-Raphson algorithm is: 
x[n+1] = x[n](2.0 — vx[n]) 


In this algorithm, vis the number for which the reciprocal is desired. x[0] is the 
seed for the algorithm and is given by RCPF. At every iteration of the algorithm, 
the number of bits of accuracy in the mantissa doubles. Using RCPF, accuracy 
starts at eight bits. With one iteration, accuracy increases to16 bits in the man- 
tissa, and with the second iteration, accuracy increases to 32 bits in the mantis- 
sa. Example 3-8 shows the program for implementing this algorithm on the 
’CA4x, 
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Example 3-8. Inverse of a Floating-Point Number With 32-Bit Mantissa Accuracy 


TITLE 


INVERSE OF A FLOATING-POINT NUMBER WITH 32-BIT 


MANTISSA ACC 


SUBROUTIN 


| 
H 


VE 


URACY 


HE FLOATING-POINT NUMB 
COMPUTATION IS COMPLETE 


ER v IS STORED IN RO. AFTER THE 


D, 1/v IS STORED IN Rl. 


3-14 


+ + + FF FF FF FF F FF F F FF FF F F FF F F F FF F FF F 


-global INVF 

* 

INVE: RCPF RO,R1 

* 
MPYF3 R1,RO,R2 
SUBRF 2.0,R2 
MP YF R2,R1 

* 
BUD R11 

* 
MPYF3 R1,RO,R2 
SUBRF 2.0,R2 
MP YF R2,R1 

* 

* Rl = 1/v, Return to 

* 
-end 


TYPICAL CALLING SEQUENCE 

LAJU INVF 

LDF v, RO 

OP < can be other non-pipeline-break 

NOP <---- instructions 

ARGUMENT ASSIGNMENTS: 

ARGUMEN | FUNCTION 

i Stale har dates Fee PR OR A ORI RD RIPE PRP RIPRISR ORIE RIERA PRIOR PRIOR 
RO | v = NUMBER TO FIND THE RECIPROCAL OF 

| (UPON THE CALL) 

R1 | 1/v (UPON THE RETURN) 

REGISTER USED AS INPUT: RO 

REGISTERS MODIFIED: Rl, R2 

REGISTER CONTAINING RESULT: RL 

REGISTER FOR SUBROUTINE CALL: R11 


CYCLES: 7 (not including subroutine overhead) 
WORDS: 8 (not including subroutine overhead) 


;Get x[0] = the 
;estimate of 1/v, RO =v 


;End of first iteration 
(16 bits accuracy) 


~ 


;Delayed return to caller 


;End of second iteration 
7 (32 bits accuracy) 


caller 
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3.6 Calculating a Square Root 


In many applications, normalization of data values is necessary. Often, the 
normalizing factor is the square root of another quantity. For example, given 
a vector, the unit vector in the same direction as the original vector can be 
found by normalizing the original vector by its length. This involves a division 
by a square root. The ’C4x single-cycle instruction RSQRF generates an 
estimate of the reciprocal of the square root of a positive floating-point number. 
This estimate has the correct exponent, and the mantissa is accurate to the 
eighth binary place (the error of the mantissa is < 2-8). Three rules apply to this 
algorithm: 


1) If v(exp) is even, then x(exp) = —(v(exp)/2) — 1, and 
x(man) = 2/sqrt(v(man)). 


For the special case where the 10 MSBs of y(man) = 01.00000000b, then 
x(man) =2—2-8 = 01.11111111b. In both cases, the 23 LSBs of x(frac) = 0. 


2) If v(exp) is odd, then x(exp) = —((v(exp) — 1)/2) — 1 and 
x(man) = sqrt(2/v(man)). The 23 LSBs of x(frac) = 0. 


3) Ifv=0(v(exp) =—-128 ), then x(exp) = 127, and 
x(man) = 01.11111119991911111111191111111111b. 
In other words, if v = 0, then x becomes the largest positive number repre- 
sentable in the extended-precision floating-point format. The overflow flag 
(V) is set to 1. 


If you need larger precision than the RSQRF instruction gives for the estimate 
of the reciprocal of the square root, you can use the Newton-Raphson algo- 
rithm to further extend the precision of the mantissa. The algorithm is: 


x[n+1] = x[n](1.5 — (v/2) x [n] x [Nn]) 


In this equation, vis the number for which the reciprocal is desired. x[O] is the 
seed for the algorithm and is given by RSQRF. At every iteration of the algo- 
rithm, the number of bits of accuracy in the mantissa doubles. Using RSQRF, 
accuracy starts at eight bits. With one iteration, accuracy increases to16 bits, 
and with the second iteration, accuracy increases to 32 bits in the mantissa. 
Example 3-9 shows the program for implementing this algorithm on the ’C4x. 
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Example 3-9. Reciprocal of the Square Root of a Positive Floating Point 


TITLE RECIPROCAL OF THE SQUARE ROOT OF A POSITIV 
FLOATING-POINT 


GI 


SUBROUTINE RCPSORF 


HE FLOATING-POINT NUMBI 
COMPUTATION IS COMPLET 


ER v IS STORED IN RO. AFTER THE 
D, 1/SQRT(v) IS STORED IN R1. 


= 


TYPICAL CALLING SEQUENCE: 
LDF v, RO 
LAJU RCPSOQRF 


ARGUMENT ASSIGNMENTS: 


ARGUMEN | FUNCTION 
+ 

RO | v = NUMBER TO FIND THE RECIPROCAL OF 
| (UPON THE CALL) 

R1 | 1/sqrt(v) (UPON THE RETURN) 

REGISTER USED AS INPUT: RO 

REGISTERS MODIFIED: Rl, R2 

REGISTER CONTAINING RESULT: R1 

REGISTER FOR SUBROUTINE CALL: R11 


CYCLES: 10 (not including subroutine overhead) 
WORDS: 10 (not including subroutine overhead) 


+ + + + * FF FF FF F FF F FF F FF F FF F F F FF F FF OF 


-global RCPSQRF 


* 


RCPSORF: RSQRF RO,R1 ;Get x[0] = th stimate of 1/sqrt(v), RO =v 
PYF O.057RO ;RO = v/2 
* 
PYF3 R1,R1,R2 ;First iteration 
PYF RO, R2 
SUBRF 1.5,R2 
PYF R2,R1 ;End of first iteration (16 bits accuracy) 
* 
PYF3 R1,R1,R2 ;Second iteration 
* 
BRD R11 ;Delayed return to caller 
* 
MP YF RO, R2 
SUBRF 1.5,R2 
MPYF R2,R1 ;End of second iteration (32 bits accuracy) 


Rl = 1/SQRT(v), Return to caller 


.end 


You can find the square root by a simple multiplication: sqrt(v) = vx[n] in which 
x[n] is the estimate of 1/sqrt(v) as determined by the Newton-Raphson algo- 
rithm or another algorithm. 


3-16 


Extended-Precision Arithmetic 


3.7 Extended-Precision Arithmetic 


The ’C4x offers 32 bits of precision in the mantissa for integer arithmetic, and 
24 bits of precision in the mantissa for floating-point arithmetic. For higher pre- 
cision in floating-point operations, the twelve extended-precision registers, RO 
to R11, contain eight more bits of accuracy. Because no comparable extension 
is available for fixed-point arithmetic, this section discusses how to achieve 
fixed-point double precision. The technique consists of performing the arith- 
metic by parts and is similar to the way in which longhand arithmetic is done. 


The instructions, ADDC (add with carry) and SUBB (subtract with borrow) use 
the status carry bit for extended-precision arithmetic. The carry bit is affected 
by the arithmetic operations of the ALU and by the rotate and shift instructions. 
You can also manipulate it directly by setting the status register to certain val- 
ues. For proper operation, the overflow mode bit should be reset (OVM = 0) 
so that the accumulator results are not loaded with the saturation values. 
Example 3-10 and Example 3—11 show 64-bit addition and 64-bit subtraction, 
respectively. The first operand is stored in the registers RO (low word) and R1 
(high word). The second operand is stored in registers R2 and R3, respective- 
ly. The result is stored in RO and R1. 


Example 3-10. 64-Bit Addition 


* 

* TITLE 64-BIT ADDITION 

* 

* TWO 64-BIT NUMBERS ARE ADDED TO EACH OTHER PRODUCING 
* 

* A 64-BIT RESULT. THE NUMBERS X (R1,RO) AND Y (R3,R2) 
* 

* ARE ADDED, RESULTING IN W (R1,R0O). 

* 

* RL RO 

i + R3 R2 

Ka ee 

is RL RO 

* 


ADDI R2,R0 
ADDC R3,R1 
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Example 3-11. 64-Bit Subtraction 


* 

* TITLE 64-BIT SUBTRACTION 

* 

* TWO 64-BIT NUMBERS ARE SUBTRACTED FROM EACH OTHER 
* PRODUCING A 64-BIT RESULT. THE NUMBERS X (R1,RO) AND 
* y (R3,R2) ARE SUBTRACTED, RESULTING IN W (R1,RO). 
* 

* Rl RO 

*  - R3  R2 

A 

* Rl RO 

* 


When two 32-bit numbers are multiplied, a 64-bit product results. To do this, 
’C4x provides a 32 bit x 32-bit multiplier and two special instructions, MPYSHI 
(multiply signed integer and produce 32 MSBs) and MPYUHI (multiply un- 
signed integer and produce 32 MSBs). Example 3-12 shows the implementa- 
tion of a 32-bit x 32-bit multiplication. 


Example 3-12. 32-Bit by 32-Bit Multiplication 
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* 

* TITLE 32 BIT x 32-BIT MULTIPLICATION 

* 

* MULTIPLIES 2 32-BIT NUMBERS, PRODUCING A 64-BIT RESULT. 
* THE TWO NUMBERS RO AND R1 ARE MULTIPLIED, RESULTING 
* IN W (R3,R2). 

* 

* RO 

* x RI 

* a 

* R3 R2 

* 


MPYI3 RO,R1,R2 


MPYSHI3 RO,R1,R3 
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3.8 Floating-Point Format Conversion: IEEE to/From ’C4x 


In fixed-point arithmetic, the binary point that separates the integer from the 
fractional part of the number is fixed at a certain location. Therefore, if the 
binary point of a 32-bit number is fixed after the most significant bit (which is 
also the sign bit), only a fractional number (a number with an absolute value 
less than 1), canbe represented. In other words, there is anumber with 31 frac- 
tional bits. All operations assume that the binary point is fixed at this location. 
The fixed-point system, although simple to implement in hardware, imposes 
limitations in the dynamic range of the represented number. This causes scal- 
ing problems in many applications. You can avoid this difficulty by using floa- 
ting-point numbers. 


A floating-point number consists of a mantissa m multiplied by base b raised 
to an exponent e: 


mx bé 


In current hardware implementations, the mantissa is typically a normalized 
number with an absolute value between 1 and 2, and the base is b = 2. Al- 
though the mantissa is represented as a fixed-point number, the actual value 
of the overall number floats the binary point because of the multiplication by 
b& The exponent e is an integer whose value determines the position of the 
binary point in the number. IEEE has established a standard format for the re- 
presentation of floating-point numbers. 


To achieve higher efficiency in the hardware implementation, the ’C4x uses a 
floating-point format that differs from the IEEE standard. However, 'C4x has 
two single-cycle instructions, TOIEEE and FRIEEE, for the format conversion. 
These two instructions can also be used with the STF instruction, which allows 
the data format to be converted within memory-to-memory transfer. Here are 
descriptions of both formats and an example program to convert between 
them. 


*C4x floating-point format: 
8 bits 1 23 bits 


a rr ee 


In a 32-bit word representing a floating-point number, the first 8 bits corre- 
spond to the exponent expressed in twos-complement format. One bit is for 
sign, and 23 bits are for the mantissa. The mantissa is expressed in twos-com- 
plement form with the binary point after the most significant nonsign bit. Be- 
cause this bit is the complement of the sign bit s, it is suppressed. In other 
words, the mantissa actually has 24 bits. One special case occurs when 
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e =—128. In this case, the number is interpreted as zero, independently of the 
values of s and f (which are, by default, set to zero). To summarize, the values 
of the represented numbers in the ’C4x floating-point format are as follows: 


2e* (01f) if s=0 
2e* (10.f) ifs=1 
0 if e= 128 


IEEE floating-point format: 
1 8 bits 23 bits 


a a 


The IEEE floating-point format uses sign-magnitude notation for the mantissa. 
In a 32-bit word representing a floating-point number, the first bit is the sign bit. 
The next 8 bits correspond to the exponent, expressed in an offset-by-127 for- 
mat (the actual exponent is e-127). The following 23 bits represent the abso- 
lute value of the mantissa with the most significant 1 implied. The binary point 
is fixed after this most significant 1. In other words, the mantissa actually has 
24 bits. Several special cases are summarized below. 


These are values of the represented numbers in the IEEE floating-point for- 
mat: 


lye *eecter * (01s) if 0 < e< 255 

Special cases: 

(1)2* 0.0 if e = 0 and f = 0 (zero) 

(=1}S* 2-126" (if) if e = 0 and f <> 0 (denormalized) 
(—1)$* infinity if e = 255 and f = 0 (infinity) 

NaN (not a number) if e= 255 andf <>0 


The ’C4x performs the conversion according to these definitions of the for- 
mats. It assumes that the source data for the IEEE format is in memory only 
and that the source data for the ’C4x floating-point format is in either memory 
or an extended-precision register. The destination for both conversions must 
be in an extended-precision register. In the case of block memory transfer, the 
no-penalty data-format conversion can be executed by parallel instruction with 
STF. Example 3-13 and Example 3-14 show the data-format conversion 
within the data transformation between communication port and internal RAM. 
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Example 3-13. IEEE to 'C4x Conversion Within Block Memory Transfer 


* TITLE IEEE TO ’C4x CONVERSION WITHIN BLOCK MEMORY 

* TRANSFER 

* 

* PROGRAM ASSUMES THAT INPUT FIFO OF COMMUNICATION PORT 0 
* IS FULL OF IEEE FORMAT DATA. EIGHT DATA WORDS ARE 

* TRANSFERRED FROM COMMUNICATION PORT 0 TO INTERNAL RAM 

* BLOCK 0 AND THE DATA FORMAT IS CONVERTED FROM IEEE FORMAT 
* TO 'C4x FLOATING-POINT FORMAT. 

* 


LDI @CPO_IN, ARO ;Load comm portO input FIFO address 
LDI @RAMO, AR1L ;Load internal RAM block 0 address 
FRIEEE *ARO, RO ;Convert first data 
RPTS 6 
FRIEEE *ARO, RO ;Convert next data 

|| STF RO, *AR1++(1) ;Store previous data 


STF RO, *AR1+ 


t 
pian 


;Store last data 


Example 3-14. 'C4x to IEEE Conversion Within Block Memory Transfer 


* ITLE ’C4x TO IEEE CONVERSION WITHIN BLOCK MEMORY 
* TRANSFER 
* 
* PROGRAM ASSUMES THAT OUTPUT FIFO OF COMMUNICATION PORT 0 
* IS EMPTY. EIGHT DATA WORDS ARE TRANSFERRED FROM INTERNAL 
x RAM BLOCK 0 TO COMMUNICATION PORT 0 AND THE DATA FORMAT 
* IS CONVERTED FRO 'C4x FLOATING-POINT FORMAT TO 
* EEE FORMAT. 
* 
LDI @CPO_OUT,ARO ;Load comm port0O output FIFO address 
LDI @RAMO, AR1L ;Load internal RAM block 0 address 
OIEEE *AR1++(1),RO ;Convert first data 
RPTS 6 
OIEEE *AR1++(1),RO ;Convert next data 
|| STF RO, *ARO ; Store previous data 
STF RO, *ARO ;Store last data 
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Memory Interfacing 


The ’C4x’s advanced interface design can be used to implement a wide variety 
of system configurations. Its two external buses and DMA capability provide 
a flexible parallel 32-bit interface to byte- or word-wide devices. 


This chapter describes how to use the ’C4x’s memory interfaces to connect to 
various external devices. Specific discussions include implementation of a 
parallel interface to devices with and without wait states and implementing 
system control functions. 
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4.1 System Configuration 


Figure 4—1 illustrates an expanded configuration of a ’C4x system with differ- 
ent types of external devices and the interfaces to which they are connected. 


Figure 4—1. Possible System Configurations 


Fast local Large shared 
*C4x 
Peripherals 
Local bus 
Global bus 
Interrupt = Communication 
Peripherals interface ports 
Bit /O External flags 


Timer interface tr 


Peripherals 


I/O devices 
Timer interface I/O devices 


Clock, reset 
generator, etc 


System 
control 


In your design, you can use any subset or superset of the illustrated compo- 
nents. 
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4.2 External Interfacing 


The ’C4x interfaces connect to a wide variety of device types. Each of these 
interfaces is tailored to a particular type of device such as memory, DMA, par- 
allel and serial peripherals, and I/O. In addition, 'C4x devices can interface di- 
rectly with each other, without external logic, through their communication 
ports or their external flag pins IIOF(0-3). Each interface comprises one or 
more signal lines, which transfer information and control its operation. 
Figure 4—2 shows the signal groups for these interfaces. 


Figure 4-2. External Interfaces 


Data +3 + Data 
address |_Y '._ address 
Data enable iq Datta enable 
Address enable —~¢———— Address enable 
Status <4 H_ 4.» Status 
Interlock signal —————_ Interlock signal 
= Local 
STRBO control > LSTRBO control Bus 
OTDD). —_—__— 
STRBO control enable -<————___ LSTRBO control enable 
|_—____p> 
STRB1 control i = LSTRBI control 
po re mee ———Se eee 
STRB1 control enable -~¢———— __ LSTRB1 control enable 
Interrupt and I/O Flags 4 CnD(7-0) jg SOc 
Nonmaskable interrupt CREQn |-<+——» | Communication 
Interrupt acknowledge CACKn |-<+—___» port interface 
Reset and 2 CSTRBn |-4——__> (6 Sets) 
ROM control Sela 0) CRDYn | , 
x1 TCLKO |< »__ Timer interface 
Master clock X2/CLKIN TCLK1 +» “ and I/O flags 
Clock outputs <{ a 
TCK 
TDO 
Emulation TDI 


interface 


IMS_ 
TRST 
EMUO 
EMU1 


n = 0 for communication port 0, n = 1 for communication port 1, etc. 


The global and local buses implement the primary memory-mapped interfaces 
to the device. These interfaces allow external devices such as DMA controllers 
and other microprocessors to share resources with one or more ’C4x devices 
through a common bus. 
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4.3 Global and Local Bus Interfaces 


The ’C4x uses the global and local buses to access the majority of its 
memory-mapped locations. Since these two memory interfaces are identical 
in every way, except for their positions in the memory map, each example in 
this memory interface section focuses on only one of the two interfaces. How- 
ever, all of the examples are applicable to either the local or global bus. The 
buses have identical but mutually exclusive sets of control signals: 


Table 4—1. Local/Global Bus Control Signals 


Global Bus 
STRBO 
STRB1 
CEO 
CE1 
RDYO 
RDY1 
AE 

DE 
PAGEO 
PAGE1 
R/Wo 
R/W1 


Local Bus 
LSTRBO 
LSTRB1 
LCEO 
LCE1 
LRDYO 
LRDY1 
LAE 
LDE 
LPAGEO 
LPAGE1 
LR/Wo 
LR/W1 


While both the global bus and the local bus can interface to a wide variety of 
devices, they most commonly interface to memories. 
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4.4 Zero Wait-State Interfacing to RAMs 


A memory-read access time is normally defined as the time between address 
valid and data valid. This time can be determined by: 


Read access time = = te) — (tg(H1L-A) + tsu(D)R) 


where: 

tc(H) =  H1/H3 cycle time 

tq(H1L-A) = —_H1 low to address valid 

tsu(D)R = Data valid before next H1 low (read) 


For a full-speed, zero wait-state interface to any device, a 50-MHz ’C4x (40-ns 
instruction cycle time) requires a read access time of 21 ns from address stable 
to data valid. For most memories, the access time from chip enable is the same 
as access time from address; thus, it is possible to use 20-ns memories at full 
speed with a 50-MHz ’C4x. However, to use 20-ns memories properly, you 
must avoid long delays between the processor and the memories. 


Avoiding these delays is not always possible, because interconnections and 
gating for chip-enable generation can cause them. In addition, if you choose 
amemory device with an output enable, the output enable must become active 
quickly enough to ensure that the memory can meet the data valid timing 
requirements of the ’C4x. For memories with 20-ns access times, the output 
enable active to data valid timing parameter is typically less than 10 ns. 


Currently available RAMs without output-enable (OE) control lines include the 
1-bit wide organized RAMs and most of the 4-bit wide RAMs. Those with OE 
controls include the byte-wide and a few of the 4-bit wide RAMs. Many of the 
fastest RAMs do not provide OE control; they use chip-enable (CE) controlled 
write cycles to ensure that data outputs do not turn on for write operations. In 
CE-controlled write cycles, the write control line (WE) goes low before CE goes 
low, and internal logic holds the outputs disabled until the cycle is completed. 
Using CE-controlled write cycles is an efficient way to interface fast RAMs 
without OE controls to the ’C4x at full speed. 


| 
Note: 


You can find timing parameters for CLKIN, H1, H3, and memory in the 
TMS320C40 and TMS320C44 data sheets. 


| 
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4.4.1 


Consecutive Reads Followed by a Write Interface Timing 


Figure 4—3 shows the timing of consecutive reads followed by a write. For con- 
secutive reads, LSTRBO stays active (low), and LR/W stays high as long as 
read cycles continue. For back-to-back reads, the ’‘C4x requires zero-wait- 
state memories to have an address-valid to data-valid time of less than 21 ns. 


For most memory devices, this time is the same as the memory access time, 
which is tj = 20 ns. Thus, memories with access times of 25 ns or more cannot 
meet this timing. 


Memory device timing is not as critical for zero-wait-state as for nonzero-wait- 
state write cycles, because of the two H1 cycle writes of the ’C4x. The extra 
cycle gives LSTRBO enough time to frame LR/W, preventing memories that 
go into high impedance slowly at the end of a read cycle from driving the bus 
during the subsequent write cycle. For the memory device used in this design 
(Figure 4—3), the data lines are guaranteed to into high impedance (to = 10 ns) 
after CS goes inactive, which gives more than 23 ns of margin before the ’C4x 
starts driving the bus with write data. Also, the extra cycle with LSTRBO 
inactive prevents writes to random locations in memory while the address is 
changing between consecutive writes. 


For the write cycles shown in Figure 4—3 and Figure 4-4, the RAM requires 
15 ns of write data setup before CS goes high, and this design provides at least 
24 ns (tg). A data hold time of 0 ns (t4) is required by the RAM, and this design 
provides greater than 13 ns. Finally, the RAM’s 20-ns setup and 0-ns hold 
times for address (with respect to CS high) ensure a clear margin. 


Figure 4—3. Consecutive Reads Followed by a Write 


LSTRBO ee re 
t | 
| 


ay Valid : . 
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Figure 4—4. Consecutive Writes Followed by a Read 
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LD(31-0) Valid write data Valid write data Valid data 
LA(30-0) Valid write address Valid write address Read address 


4.4.2 Consecutive Writes Followed by a Read Interface Timing 


Figure 4—4 shows the timing of consecutive writes followed by a read. Notice 
that between consecutive writes, LR/W stays low, but STRBO goes inactive to 
frame the write cycles. Although ’C4x zero-wait-state writes take two H1 
cycles, writes appear to take one cycle internally (from the perspective of the 
CPU and DMA) if no access to the interface is already in progress. 


In the read cycle following the writes in Figure 4—4, the ’C4x requires zero-wait- 
state memories to have a LSTRB-active to data-valid time of less than 21 ns 
(one H1 cycle minus (H1 low to LSTRB active plus data setup before H1 low)). 
For most memory devices, this time is the same as the memory access time, 
which is ty = 20 ns in this design. Thus, a margin of only 1 ns exists, leaving 
little allowance for STRB gating if desired. 


4.4.3. RAM Interface Using One Local Strobe 


Figure 4—5 shows the ’C4x’s local bus interfaced to eight Integrated Device 
Technology IDT71258 20-ns 64K x 4-bit CMOS static RAMs with zero wait 
states using chip-enable controlled write cycles. The SRAMs are arranged to 
implement the first 64K, 32-bit words in external memory, located at addresses 
00000h thru OFFFFh (internal ROM is assumed to be disabled). If these 64K 
words of SRAM are the only memory controlled by LSTRBO, the LSTRB AC- 
TIVE field of the local memory interface control register (LMICR) should be set 
to its minimum value of 011119, allowing LSTRBO to be active for only the first 
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64K words of the ’C4x’s memory space. In addition, if this memory is the only 
memory interfaced to LSTRBO, LSTRBO requires only one page, and the PA- 
GESIZE field of the LMICR should be set to 011115. Also note that in 
Figure 4—5, the LRDYO input is tied low, selecting zero wait states for all 
LSTRBO accesses on the local bus. With all of the zero-wait-state memory 
controlled by LSTRBO, LSTRB1 can be used to control accesses to slower 
read-only memory devices or other types of memory. 


Figure 4—5. ’C4x Interface to Eight Zero-Wait-State SRAM 


IDT71258 SRAM 


IDT71258 SRAM 


In this circuit implementation, no external logic is necessary to interface the 
’C4x to the memory device. Typically, memory devices must be held inactive 
(CS inactive) during changes in WE; this avoids undesired memory accesses 
while the address changes. The ’C4x ensures this glueless interface because 
LSTRB always frames changes in LR/W. 


4.4.4 RAM Interface Using Both Local Strobes 


Figure 4—6 shows the ’C4x’s local bus interfaced to HM6708 — 20-ns 64K x 
4-bit CMOS static RAMs with zero wait states using CS controlled write cycles. 


Zero Wait-State Interfacing to RAMs 


These RAMs are arranged to allow 128K 32-bit words of local memory, which 
are implemented as two 64K x 32-bit banks. One bank is controlled by each 
of the two sets of control signals on the local bus. To map these memory de- 
vices properly in the ’C4x’s memory space, you must use the local-memory-in- 
terface control register (LMICR) to define which part of the local bus’s memory 
space is mapped to each of the two strobes. In this implementation with inter- 
nal ROM disabled, LSTRBO is mapped to the first 64K words of the local space 
(addresses Oh through OFFFFh), and LSTRB1 is mapped to the rest of the lo- 
cal space (addresses 10000h through 7FFF FFFFh). For this memory config- 
uration, the LSTRB ACTIVE field of the local-memory-interface control regis- 
ter (LMICR) should be set to 011115. Also, each LSTRB requires only one 
page. The PAGESIZE field of the LMICR should be set to 011119. Note that in 
Figure 46, the LRDY inputs are tied low, selecting zero wait states for all ac- 
cesses on the local bus. 


Hence, through the use of the ’C4x’s four strobes (two each on the local and 
global buses), four different banks of memory can be decoded. In addition, 
through program control, you can change the address decoding under pro- 
gram control by changing the LSTRB active field (bits 24—28) of the LMICR or 
the global-memory-interface control register (GMICR). If you must decode 
more than four banks of memory or if the chosen memory device cannot meet 
the read cycle timing requirements for the ’C4x at zero wait states, you should 
use page switching (discussed in subsection 4.5.6 on page 4-18) to add an ex- 
tra cycle to read accesses outside the current bank boundary. 
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Figure 4-6. 'C4x Interface to Zero-Wait-State SRAMs, Two Strobes 
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4.5 Wait States and Ready Generation 


Using wait states can greatly increase a system’s flexibility and reduce its 
hardware requirement. The ’C4x is capable of generating wait states on either 
the global bus or the local bus, and both buses have independent sets of ready 
control logic. The buses’ wait-state configuration is determined by the SWW 
and WTCNT fields of the local and global-bus-interface control registers. 


This section discusses ready generation from the perspective of the global- 
bus interface; however, wait-state operation on the /ocal bus is the same as 
on the global bus, so this discussion pertains equally well to both (local and 
global). Also, the local and global buses each have two sets of control signals 
— R/Wo0, STRBO, RDYO, PAGEO, CEO and R/W1, STRB1, RDY1, PAGE1, 
CE1— with each set of control signals having its own ready signal, providing 
for more flexibility in support of external devices with different speeds. Since 
both strobes’ ready signals share the same electrical characteristics, the fol- 
lowing discussion focuses on one of the global bus’s set of control signals. 


Wait states are generated by: 

Lj The internal wait-state generator 

O) The external ready inputs (RDYO or RDY1) 

[1 The logical AND or OR of the two ready signals 


When enabled, internally generated wait states affect all external cycles, re- 
gardless of the address accessed. If different numbers of wait states are re- 
quired for various external devices, the external RDY input can be used to cus- 
tomize wait-state generation to specific system requirements. 


If either the logical OR or electrical AND (since the signals are true low) of the 
external and wait-count ready signals is selected, the earlier of the two signals 
will generate a ready condition and allow the cycle to be completed. It is not 
required that both signals be present. 


Memory Interfacing 4-11 


Wait States and Ready Generation 


4.5.1 


ORing of the Ready Signals (STRBx SWW = 10) 


You can use the OR of the two ready signals to implement wait states for de- 
vices that require more wait states than internal logic can implement (up to 
seven). This feature is useful, for example, if a system contains some fast and 
some slow devices. In this case: 


j) Fast devices can generate ready externally with a minimum of logic. 
When fast devices are accessed, the external hardware responds prompt- 
ly with ready, which terminates the cycle. 


j) Slow devices can use the internal wait counter for larger numbers of wait 
states. When slow devices are accessed, the external hardware does not 
respond, and the cycle is appropriately terminated after the internal wait 
count. 


The OR of the two ready signals can also terminate the bus cycle before the 
number of wait states implemented with external logic allows termination. In 
this case, a shorter wait count is specified internally than the number of wait 
states implemented with the external ready logic, and the bus cycle is termi- 
nated after the wait count. Also, this feature can be used as a safeguard 
against inadvertent accesses to nonexistent memory that would never re- 
spond with ready and would, therefore, lock up the ’C4x. 


If the OR of the two ready signals is used, however, and the internal wait-state 
count is less than the number of wait states implemented externally, the 
external ready generation logic must be able to reset its sequencing to allow 
a new cycle to begin immediately following the end of the internal wait count. 
Also, the consecutive cycles must be from independently decoded areas of 
memory (or from different pages in memory). Otherwise, the external ready 
generation logic may lose synchronization with bus cycles and generate 
improperly timed wait states. 


4.5.2 ANDing of the Ready Signals (STRBx SWW = 11) 
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If the logical AND (electrical OR) of the wait count and external ready signals 
is selected, the later of the two signals will control the internal ready signal, but 
both signals must be asserted. Accordingly, external ready control must be im- 
plemented for each wait-state device, and the wait count ready signal must be 
enabled. 


This feature is useful if devices in a system are equipped to provide a ready 
signal but cannot respond quickly enough to meet the ’C4x’s timing require- 
ments. If these devices normally indicate a ready condition and, when ac- 
cessed, respond with a wait until they become ready, the logical AND of the 
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two ready signals can be used to save hardware in the system. In this case, 
the internal wait counter can provide wait states initially, and then the external 
ready can provide wait states after the external device has had time to send 
a not-ready indication. The internal wait counter then remains ready until the 
external device also becomes ready, which terminates the cycle. 


Additionally, the AND of the two ready signals can be used for extending the 
number of wait states for devices that already have external ready logic imple- 
mented, but require additional wait states under certain unique circumstances. 


4.5.3 External Ready Generation 


The optimum technique for implementing external ready generation hardware 
depends on the specific characteristics of the system, including the relative 
number of wait-state and nonwait-state devices in the system and the 
maximum number of wait states required for any one device. The approaches 
discussed here are intended to be general enough for most applications and 
are easily modifiable to comprehend many different system configurations. 


In general, ready generation involves the following three functions: 
1) Segmentation of the address space to distinguish fast and slow devices 
2) Generation of properly timed ready indications 


3) Logical ORing of all the separate ready timing signals together to 
connect to the physical ready input 


Segmentation of the address space is required to obtain a unique indication 
of each particular area within the address space that requires wait states. This 
segmentation is commonly implemented in the form of chip-select generation. 
Chip-select signals can initiate wait states in many cases; however, 
occasionally, chip-select decoding considerations may provide signals that do 
not allow ready input timing requirements to be met. In this case, you can seg- 
ment coarse address space on the basis of a small number of address lines, 
where simpler gating allows signals to be generated more quickly. In either 
case, the signal that indicates that a particular area of memory is being 
addressed also normally initiates the ready or wait-state signal. 


When address space to be accessed has been established, a timing circuit is 
normally used to provide a ready indication to the processor at the appropriate 
point in the cycle to satisfy each device’s unique requirements. 


Finally, since indications of ready status from multiple devices are typically 
present, you should logically OR the signals by using a single gate to drive the 
RDY input. 
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4.5.4 Ready Control Logic 


You can take one of two basic approaches to implement ready control logic, 
depending on the state of the ready input between accesses. If RDY is low be- 
tween accesses, the processor is always ready unless a wait state is required; 
if RDY is high between accesses, the processor will always enter a wait state 
unless a ready indication is generated. 


lf RDY is low between accesses, control of devices that are zero-wait-state 
at full speed is straightforward; no action is necessary, because ready is al- 
ways active unless otherwise required. Devices requiring wait states, howev- 
er, must drive ready high fast enough to meet the input timing requirements. 
Then, after an appropriate delay, a ready indication must be generated. This 
can be difficult in many circumstances because wait-state devices are in- 
herently slow and often require complex select decoding. 


lf RDY is high between accesses, zero-wait-state devices, which tend to be 
inherently fast, can usually respond immediately with a ready indication. Wait- 
state devices can simply delay their select signals appropriately to generate 
a ready. Typically, this approach results in the most efficient implementation 
of ready control logic. Figure 4—7 shows a circuit of this type, which can be 
used to generate 0, 1, or 2 wait states for multiple devices in a system. 


Figure 4—7. Logic for Generation of 0, 1, or 2 Wait States for Multiple Devices 
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4.5.5 Example Circuit 


Figure 4—7 shows how a single, 7-ns 16R4 programmable logic device (PLD) 
can be used to generate 0, 1, and 2 wait states for multiple devices that are 
interfaced to a ’C4x. In this example, distinct address bits are used to select 
the different wait-state devices. Here, each of the three address lines input to 
the 16R4 corresponds to a different speed device. For a single 16R4 imple- 
mentation, up to nine different address bits can be used to select different 
speed devices. 


The single output, 4Q, of the PLD is connected directly to the RDYO input of 
the ’C4x to signal the completion of a bus access for external wait-state gen- 
eration. Because RDYO is sampled on the falling of H1, the H3 output clock is 
used as the PLD clock input. 


Example 4—1 shows the ready logic equations for programming the 16R4 
PLD. The PLD language used is ABEL. STRBO is an input into the PLD that 
indicates that a valid ’C4x bus cycle is occurring. Also, a delayed version of 
STRBO (synchronized with H1 going high) is provided as the sirb_syn_ input 
signal. This delayed signal is needed to avoid problems with a race condition 
that may exist between STRBO going low and H rising. RESET can be used 
to bring the state machine back to the idle state. 


Notice that the RDYO output of the PLD is not registered. An asynchronous 
RDYO signal is necessary to generate a ready signal for zero-wait-state de- 
vices. When a zero-wait-state device is selected (ahi1 high in Example 4—1) 
and STRBO is low, the PLD asserts RDYO low within 7 ns. Hence, RDYO goes 
active fast enough to satisfy the 20-ns setup time of RDYO low before H1 low. 


For generation of RDYO for one and two wait states, the device select address 
bits and strb_syn_ are delayed one and two cycles, respectively, by the PLD 
before a RDYO is brought active low. The one H3-cycle delay, required for one- 
wait-state device ready generation, corresponds to state wait_one in 
Example 4—1 and the two H3-cycle delay required for two-wait-state devices 
corresponds to state wa/t_twoa and wait_twob. 


This 16R4 PLD-based design can be used to implement different numbers of 
wait states for multiple devices. More devices can be selected with ’C4x ad- 
dress lines, and a higher number of wait states can be produced with a PLD 
logic. Furthermore, this approach can be used in conjunction with the 'C4x’s 
internal wait-state generator. 
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Example 4—1.PLD Equations for Ready Generation 


0001 
0002 
0003 
0004 
0005 


module ready_generation 


title’ ready generation logic for 0, 1 and 2 wait state devices interfaced 


to TMS320C4x’ 
C40u5 device ’P16R4’; 


“inputs 
h3 Pin 1; 


“The following are TMS320C40 address bits used to 

“select the different speed devices. More can be used if 
“necessary. In this example, a zero wait state, a one wait 
“state, and a two wait state device are decoded with these 
“three address bits 


ahil Pin 2; “when high selects zero wait state device 

ahi2 Pin 3; “when high selects one wait state device 

ahi3 Pin 4; “when high selects two wait state device 
strb0_ Pin 5; “indicates valid TMS320C40 bus cycle 

reset_ Pin 6; “reset signal from TMS320C40 

strb_syn_ Pin 7; “reset strb0_ synchronized with H1 rising edge. 
“output 

rdy0_ Pin 12; “ready signal to TMS320C40 

one_wait Pin 14; “internal flip-flop signal for 1 wait state 


“device ready signal generation 

two_waita Pin 15; “internal flip-flop signal for first of the two 
“wait states for 2 wait state devices 

two_waitb Pin 16; “internal flip-flop signal for second 

“of the two wait states for 2 wait 

"state devices 


“name substitutions for test vectors 
C;-Hy lh, Xe = 6Cay 13-05 Reg 


“state bits 


outstate = [one_wait, two_waita, two_waitb]; 
idle = *bl11; 
wait_one = *b011; 


wait_twoa = *b101; 
wait_twob = %*b110; 


state_diagram outstate 
state idle: 


if (reset_ & ahi2 & !strb_syn_) then wait_one 
else if (reset_ & ahi3 & !strb_syn_) then wait_twoa 
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Example 4—1.PLD Equations for Ready Generation (Continued) 


0051 else idle; 

0052 

0053 

0054 state wait_one: 

0055 GOTO idle; 

0056 

0057 state wait_twoa: 

0058 if (reset_) then wait_twob 

0059 else idle; 

0060 

0061 state wait_twob: 

0062 GOTO idle; 

0063 

0064 equations 

0065 !rdy0_ = reset_ & ((ahil & !strb0_) # !one_wait # 
!'two_waithb) ; 

0066 

0067 @page 

0068 “Test lst level global arbitration logic 

0069 test_vectors 

0070 ({h3,ahil,ahi2,ahi3,strb0_, strb_syn_ reset -—> [outstate, rdy0_]}) 

0071 CG, X; X, X, Xi, X; L -> [idle, H } 

0072 Cc, Ly, H, Li, Li, Ly H -—> [wait_one, L i; 

0073 Gy XK, X, X, X, X, L -> [idle, H ¥ 

0074 Cc, L, Ly, H, Ly, Ly, H -—> [wait_twoa, H ; 

0075 Gy xy X, X, X, X, L -—> [idle, H ; 

0076 Cy L, Ly, H, Ly, Ly, H -> [wait_twoa, H ; 

0077 Cc, Ly, L, H, Ly, Ly, H —> [wait_twob, L ; 

0078 c, xX, X, X, X, x; Ly -> [idle, H Hi 

0079 L, H, L, L, L, L, H -—> [idle, L ; 

0080 Cc, H, Ly, Ly, L, L, H -> [idle, L H 

0081 L, L, Ly, sy: Li, Li, H -> [idle, H ; 

0082 CG; Ly, H, L, Ly, Ly, H -> [wait_one, L A 

0083 Ci Ky X, xy. X, X, H -—> [idle, H ; 

0084 Cy L, L, H, Ly, L, H —> [wait_twoa, H A 

0085 Cy L, Ly, H, L, Ly, H —> [wait_twob, L ; 

0086 Cc, H, L, Ly, L, L, H -—> [idle, L ; 

0087 Cy. OX; X, X, H, H, H -> [idle, H 7 

0088 Ci Ky X, X, H, H, H -> [idle, H 7 

0089 end ready_generation 
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4.5.6 Page Switching Techniques 
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The ’C4x’s programmable page-switching feature can greatly ease system de- 
sign when large amounts of memory or slow external peripheral devices are 
required. This feature provides a time period for disabling all device selects. 
During the interval, slow devices are allowed time to turn off before other de- 
vices have the opportunity to drive the data bus, thus avoiding bus contention. 


When page switching is enabled, any time a portion of the high-order address 
lines changes, as defined by the contents of the STRBO and STRB1 PAGE- 
SIZE fields (in the global and local memory interface control registers), the cor- 
responding STRB and PAGE go high for one full H1 cycle. Provided that STRB 
is included in chip-select decodes, this causes all devices selected by that 
STRB to be disabled during this period. The next page of devices is not en- 
abled until STRB and PAGE go low again. 


If the high-order address lines remain constant during a read cycle, the 
memory access time with page switching is the same as memory access time 
without page switching. In addition, page switching is not required during 
writes, because these write cycles exhibit an inherent one-half H1 cycle setup 
of address information before STRB goes low. Thus, when you use page 
switching for read/write devices, a minimum of half of one H1 cycle of address 
setup is provided for all accesses outside a page boundary. Therefore, large 
amounts of memory can be implemented without wait states or extra hardware 
required for isolation between pages. Also, note that access time for cycles 
during page switching is the same as that of cycles without page switching, 
and, accordingly, full-soeed accesses may still be accomplished within each 
page. 


The circuit shown in Figure 4—8 illustrates page switching with the CY7B185 
15-ns 8K x 8 BiCMOS static RAM. This circuit implements 32K 32-bit words 
of memory with full-speed zero wait-state accesses within each page. 
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Figure 4-8. Page Switching for the CY7B185 
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Bank 0 (4 x CY7B185) 
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A 5-ns, 16L8 PLD decodes lines A15 — A13. These lines along with STRBO 
select each of the four pages in this circuit. With the PAGESIZE field of STRBO 
of the global memory interface control register set to OCh, the pages are 
selected on even 8K-word boundaries, starting at location zero in external 
memory space. 


This circuit cannot be implemented without page switching, because the data 
output’s turn-on and turn-off delays cause bus conflicts, and full-speed 
accesses do not allow enough time for chip-select decoding for the four pages. 
Here, the propagation delay of the 16L8 is involved only during page switches, 
where there is sufficient time between cycles to allow new chip-selects to be 
decoded. 


The timing of this circuit for read operations with page switching is shown in 
Figure 4—9. When a page switch occurs, the page address on address lines 
A30 — A13 is updated during the extra H1 cycle while STRBO is high. Then, 
after chip-select decodes have stabilized and the previously selected page 
has disabled its outputs, STRB goes low for the next read cycle. Further 
accesses occur at full speed with the normal bus timings, as long as another 
page switch is not necessary. Write cycles do not require page switching, be- 
cause of the inherent address setup provided in their timings. 


This timing is summarized in Table 4—2. 
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Figure 4—9. Timing for Read Operations Using Bank Switching 
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Table 4—2. Page Switching Interface Timing 
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Time 


Interval Event 


ty 
to 


H1 falling to address/STRB valid 
STRB to select delay 

Memory disable from select 

H1 falling to STRB 

STRB to select delay 

Memory output enable delay 


Bank 1 on Bus 


Time 
Period 
7 ns 
5 ns 
8ns 
7 ns 
5ns 
3ns 
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4.6 Parallel Processing Through Shared Memory 


The ’C4x’s two memory interfaces allow flexibility to design shared-memory 
interfaces for parallel processing. Many processors can be linked together in 
a wide variety of network configurations through these ports. In this section, 
Figure 4—10 illustrates ’C4x shared-memory networks that you can use to fulfill 
many signal processing system needs. 


Figure 4-10. ’C4x Shared/Distributed-Memory Networks 
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One of the most common multiprocessor configurations is the sharing of 
memory by all processors in a system. Shared memory is typically 
implemented by tying the processors’ data and address lines together. Howev- 
er, the shared memory interface must guarantee that no more than one 
processor is driving the shared bus at any one time; it must also allow all 
processors sharing the bus to have a chance to access shared resources. 


The ’C4x supports shared memory multiprocessing with its identical global- 
and local-port interfaces. Both interfaces have four status output signals, 
(L)STAT3—0, which identify what type of access is beginning on the bus. These 
signals identify whether the ’C4x portis idle, a DMA read is occurring, aSTRB1 
write is occurring, a LOCKed access to memory is pending, etc. The signals 
can be interpreted by the interface to issue single access or locked access bus 
requests to a shared bus arbiter. 


The (L)CE, (L)AE, and (L)DE input signals support shared address control and 
data lines. When the signals are disabled (high), they put the port’s control 
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signals, address lines, and data lines, respectively, in the high-impedance 
state. These bus enable lines are asynchronous inputs to the ’C4x, which can 
quickly turn off bus drivers when another processor is accessing a shared 
resource. However, these signals asynchronously turn off the ’C4x’s local and 
global buses, without memory accesses being suspended. To ensure that data 
written is seen externally and data read is valid, you should use the external 
(L)RDY should be used for wait-state generation in shared memory designs. 
An (L)RDY signal should not be sent to the ’C4x until the processor has 
regained access to the bus (CE, AE, DE enabled) and has had enough time 
to complete its access. Hence, with bus enable and status signals, the ’C4x 
flexible bus interfaces easily implement high-speed shared bus configura- 
tions. 


4.6.2 Shared-Memory Interface Design Example 
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For an example of a’C4x shared-memory interface, see the TMS320C4x Par- 
allel Processing Development System Technical Reference (SPRUO75). In 
the example in that text, four ’C4x devices share SRAM with their global buses 
tied together. A bus arbitrator implemented as a programmable logic device 
provides a fair scheme for processor access to the shared bus. The design 
uses high-speed parts but employs a fully asynchronous handshake protocol 
that allows ’C4x devices of various speeds and also processors other than 
’C4x devices to be added to this bus configuration. 


The shared-memory interface in the PPDS works for 'C4x devices running at 
a speed of up to 32 MHz. For higher speeds, the arbitrator incorrectly takes 
away bus master privileges from a ’'C4x between back-to-back reads to the 
same page (the page size is determined by the page size field in the global bus 
control register. The default page size for the PPDS global memory is 64k). 
If this occurs while two or more ’C4x devices are requesting the bus to perform 
write cycles, random shared memory locations can be corrupted. 


To fix this problem for higher speeds, the busenable_ signal of each ’C4x local 
interface can be used to generate gmce0_ and gmce7_to prevent these sig- 
nals from going low (active) if all the processors busenable_ signals are high 
(inactive). The busenable_ signal is shown in the PLD equations in the Global 
Bus Interface Logic section the of the TMS320C4x Parallel Processing Devel- 
opment System Technical Reference). The gmce0 and gmce7 signals are 
shown in the Global Memory Control section of the same book. 


Chapter 5 


Programming Tips 


Programming style is highly personal and reflects each individual’s prefer- 
ences and experiences. The purpose of this chapter is not to impose any par- 
ticular style. Instead, it emphasizes some of the features of the 'C4x that can 
help in producing faster and/or shorter programs. The tips in this chapter cover 
both C and assembly language programming. 
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5.1 


Hints for Optimizing C Code 


The ’C4x’s large register file, software stack, and large memory space easily 
support the ’C4x C Compiler. The C compiler translates standard ANSI C pro- 
grams into assembly language source. It also increases the portability and de- 
creases the porting time of applications. 


The suggested methodology for developing your application follows five steps: 
1) Write the application in C. 
2) Debug the program. 
3) Estimate if the program runs in real-time. 
4) lf the program does not run in real time: 
m Use the —o2 or —03 option when compiling 
m Use registers to pass parameters (—mr compiling option) 
m Use inlining (—x compiling option) 
m Remove the —g option when compiling 


m Follow some of the efficient code generation tips listed below. 


5) Identify places where most of the execution time is spent and optimize 
these areas by writing assembly language routines that implement the 
functions. 


The efficiency of the code generated by the floating point compiler depends 
to a large extent on how well you take advantage of the compiler strengths de- 
scribed above when writing your C code. There are specific constructs that can 
vastly improve the compiler’s effectiveness: 


_j Use register variables for often—used variables. This is particularly true 
for pointer variables. Example 5-1 shows a code fragment that ex- 
changes one object in memory with another. 


Example 5—1.Exchanging Objects in Memory 


do 

{ 
REESE O? 
*++dest; 
temp; 


temp 
taitoraa( ©) 
*dest 


} 


while (--n); 


_j Pre-compute subexpressions, especially array references in loops. As- 
sign commonly used expressions to register variables where possible. 
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Use *++ to step through arrays, rather than using an index to recalculate 
the address each time through a loop. 


As an example of the previous 2 points, consider the loops in Example 5-2: 


Example 5-2. Optimizing a Loop 


/* loop 1 */ 
main () 
{ 
float a[10], b[10]; 
Abts 1s 
for (i = 0; i < 10; ++i) 
ali] = (alil * 20) + b[il; 
} 
/* loop 2 */ 
main() 
{ 
float a[10], b[10]; 
int--L; 
register float *p = a, *q = b; 
for (i = 0; i < 10; ++i) 
*pt++ = (*p * 20) + *qgtt; 
} 


Loop 1 executes in 19 cycles. Loop 2, which is the equivalent of loop 1, 
executes in 12 cycles. 


a 


Use structure assignments to copy blocks of data. The compiler gen- 
erates very efficient code for structure assignments, so nest objects within 
structures and use simple assignments to copy them. 


Avoid large local frames and declare the most often used local vari- 
ables first. The compiler uses indirect addressing with an 8-bit offset to 
access local data. To access objects on the local frame with offsets greater 
than 255, the compiler must first load the offset into an index register. This 
causes 1 extra instruction and incurs 2 cycles of pipeline delay. 


Avoid the large model. The large model is inefficient because the compil- 
er reloads the data-page pointer (DP) before each access to a global or 
static variable. If you have large array objects, use *malloc()” to dynamical- 
ly allocate them and access them via pointers rather than declaring them 
globally. Example 5-3 illustrates two methods for allocating large array 
objects: 
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Example 5—3.Allocating Large Array Objects 


/* Bad Method */ 
int a[100000]; /* BAD */ 


/* Good Method */ 


int *a = (int *)malloc(100000); 


ali] = 10; 


/* GOOD */ 
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5.2 Hints for Optimizing Assembly-Language Code 


Each program has particular requirements. Not all possible optimizations 
make sense in every case. The suggestions presented in this section can be 
used as a checklist of available software tools. 


a 


Use delayed branches. Delayed branches execute in a single cycle; reg- 
ular branches execute in four. The three instructions that follow the 
delayed branch are executed whether the branch is taken or not. If fewer 
than three instructions are used, use the delayed branch and append 
NOPs. Machine cycles (time) are still being saved. 


Use delayed subroutine call and return. Regular subroutine CALL and 
RETS execute in four cycles. You can implement a delayed subroutine call 
by using link and jump (LAJ) and delayed branches with R11 register mode 
(BUD R11) instructions. Both LAJ and BUD instructions execute in a single 
cycle. Guidelines for using the LAd instruction are the same as for delayed 
branches. 


Use the repeat single/block construct. This method produces loops 
with no overhead. Nesting such constructs will not normally increase effi- 
ciency, so try to use the feature on the most often performed loop. The 
RPTBBD is a single-cycle instruction, and the RPTS and RPTB are four- 
cycle instructions. RPTBD and delayed branches are used in similar ways. 
Note that RPTS is not interruptible, and the executed instruction is not re- 
fetched for execution. This frees the buses for operands. 


Use parallel instructions. You can have a multiply in parallel with an add 
(or subtract) and stores in parallel with any multiply or ALU operation. This 
increases the number of operations executed in a single cycle. For 
maximum efficiency, observe the addressing modes used in parallel 
instructions and arrange the data appropriately. You can have loads in 
parallel with any multiply or add (or subtract). The result of a multiply by 
one or an add of zero is the same as a load. Therefore, to implement paral- 
lel instructions with a data load, you can substitute a multiply or an add 
instruction, with one extra register containing a one or zero, in place of the 
load instruction. 


Maximize the use of registers. The registers are an efficient way to 
access scratch-pad memory. Extensive use of the register file facilitates 
the use of parallel instructions and helps avoid pipeline conflicts when you 
use register addressing. 


Use the cache. The cache speeds instruction fetches and enables sim- 
ple-cycle access, even with slow external memory. The cache is transpar- 
ent to the user, so make sure that it is enabled. 
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Lj Use internal memory instead of external memory. The internal 
memory (2K x 32 bits RAM and 4K x 32 bits ROM) is considerably faster 
to access than external memory. In a single cycle, two operands can be 
brought from internal memory. You can maximize performance if you use 
the DMA coprocessor in parallel with the CPU to transfer data you want 
to operate on to internal memory. 


Lj Avoid pipeline conflicts. For time-critical operations, make sure that 
cycles are not missed because of pipeline conflicts. If there is no problem 
with program speed, ignore this suggestion. 


(j Plan your linker command file in advance. Memory allocation for code 
and data sections can have a big impact on your algorithm performance. 
One of the ’C4x’s strengths is its sustained bandwidth achieved by having 
two external busses. By carefully dividing data and program between the 
two busses, you can minimize pipeline conflicts. You need to apply the 
same concept to minimize DMA/CPU access conflicts. 


The above checklist is not exhaustive, and it does not address some features 
in detail. To learn how to exploit the full power of the ’C4x, carefully study its 
architecture, hardware configuration, and instruction set, which are all de- 
scribed in the TMS320C 4x User’s Guide (SPRU063). 


Chapter 6 


Applications-Oriented Operations 


The ’C4x architecture and instruction set features facilitate the solution of nu- 
merically intensive problems. This chapter presents examples of applications 
that use these features, such as companding, filtering, matrix arithmetic, and 
fast Fourier transforms (FFT). 
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6.1 Companding 


In telecommunications, one of the primary concerns is to conserve the channel 
bandwidth and, at the same time, to preserve high speech quality. This is 
achieved by quantizing the speech samples logarithmically. It has been 
demonstrated that an 8-bit logarithmic quantizer produces speech quality 
equivalent to that of a 13-bit uniform quantizer. The logarithmic quantization 
is achieved by companding (COMpress/exPANDing). Two international 
standards have been established for companding: the p-law (used in the 
United States and Japan), and the A-law (used in Europe). Detailed 
descriptions of u-law and A-law companding are presented in an application 
report on companding routines included in the book Digital Signal Processing 
Applications with the TMS320 Family (literature number SPRA012A). 


During transmission, logarithmically compressed data in sign-magnitude form 
are transmitted along the communications channel. If any processing is 
necessary, these data should be expanded to a 14-bit (for u-law) or 13-bit (for 
A-law) linear format. This operation occurs when data is received at the digital 
signal processor. After processing, and in order to continue transmission, the 
result is compressed back to 8-bit format and transmitted through the channel. 


Example 6-1 and Example 6-2 show u-law compression and expansion 
(such as linear to u-law and u-law to linear conversion), while Example 6-3 
and Example 6—4 show A-law compression and expansion. For expansion, 
using a look-up table is an alternative approach. It trades memory space for 
speed of execution. Because the compressed data is 8 bits long, a table with 
256 entries can be constructed to contain the expanded data. If the 
compressed data is stored in the register ARO, the following two instructions 
put the expanded data in register RO: 


ADDI @TABL,ARO; @TABL = BASE ADDRESS OF TABLE 
LDI *ARO,RO ; PUT EXPANDED NUMBER IN RO 


The same look-up table approach could be used for compression, but the re- 
quired table length would then be 16,384 words for u-law or 8,192 words for 
A-law. If this memory size is not acceptable, you should use the subroutines 
presented in Example 6—1 or Example 6-3. 
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ITLE UW-LAW COMPRESSION 


break 


CYCLES: 14 (not including the BUD instruction) 
WORDS: 15 (not including the BUD instruction) 


;Save sign of number 


;If RO>Ox1FDE, 
; saturate the result 


;Normalize: (seg+5) OWXYZx...x 
;Adjust segment number by 2%**(-5) 
; (seg) WXYZx...x 


;Treat number as integer 
;Right-—justify 
;Delayed return 


;RO = compressed number 


* 
* 
* 
* SUBROUTINE MUCMPR 
* 
* TYPICAL CALLING SEQUENCE: 
*  LAJU  MUCMPR 
*  LDI v, RO 
* OP qSaeS can be other non-pipelin 
* OP KS5SS instructions 
* 
* ARGUMENT ASSIGNMENTS: 
* 
*  ARGUME | FUNCTIO 
* + 
* RO | v = NUMBER TO BE CONVERTED 
* 
* REGISTERS USED AS INPUT: RO 
* REGISTERS MODIFIED: RO, R1 
* REGISTER CONTAINING RESULT: RO 
* 
* 
* BENCHMARKS: 
* 
* 
* 
global [UCMPR 
* 
MUCMPR LSH3 -—6,RO0,R1 
ABSI RO, RO 
CMPI 1LFDEH, RO 
LDIG 1FDEH, RO 
ADDI 33;R0 ;Add bias 
FLOAT RO 
PYF 0.03125,R0 
LSH 1,R0 
PUSHF RO 
POP RO 
LSH —20,R0 
BUD R11 
AND O80H,R1 ;Set sign bit 
ADDI R1,RO 
NOT RO 


;Reverse all bits for transmission 
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Example 6—2.u-Law Expansion 


* 
*TITLE ‘U-LAW EXPANSION’ 
* 
* SUBROUTINE MUXPND 
* 
* TYPICAL CALLING SEQUENCE: 
ns LAJU MUXPND 
* LDI v, RO 
* NOP < can be other non-pipeline-break 
* NOP <---- instructions 
* 
* ARGUMENT ASSIGNMENTS: 
* 
* ARGUMENT | FUNCTION 
* + 
* RO | v = NUMBER TO BE CONVERTED 
* 
* REGISTERS USED AS INPUT: RO 
* REGISTERS MODIFIED: RO, Rl, R2 
* REGISTER CONTAINING RESULT: RO 
* 
* BENCHMARKS: CYCLES: 11/10 (worst/best, not including subroutine overhead) 
* WORDS: 11 (not including subroutine overhead) 
* 
* 
global MUXPND 
* 
MUXPND OT RO, RO ;Complement bits 
AND3 OFH, RO,R1 ;Isolate quantization bin 
LSH 1,R1 
ADDI 33,R1 ;Add bias to introduce 1xxxxl 
LSH3 =A) RO ; Isolate segment cod 
TSTB 08H, RO ;Test sign 
BZD R11 ;If positive, delayed return 
AND 7,R0 
LSH3 RO,R1,RO ;Shift and put result in RO 
SUBI 33,R0 ;Subtract bias 
BUD R11 ;Delayed return 
EGI RO ;Negate if a negative number 
OP 
OP 
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Example 6—3.A-Law Compression 


ITLE A-LAW COMPRESSION 


BENCHMARKS : 


CYCLES: 16/10 (worst/best, not including subroutine overhead) 


WORDS: 1 
ACMPR 

—5,R0,R1 
RO, RO 
1FH, RO 


OFFFH, RO 
OFFFH, RO 


080H, R1 
R1, RO 
OD5H, RO 


6 (not including subroutine overhead) 


* 

* 

* 

* SUBROUTINE ACMPR 

* 

* TYPICAL CALLING SEQUENCE: 

* LAJ ACMPR 

* LDI v, RO 

* OP < can be other non-pipeline-break 
* OP <---- instructions 

* 

* ARGUMENT ASSIGNMENTS: 

*  _ARGUME | FUNCTION 

* + 

* RO | v = NUMBER TO BE CONVERTED 
* 

* REGISTERS USED AS INPUT: RO 

* REGISTERS MODIFIED: RO, R1 

* REGISTER CONTAINING RESULT: RO 
* 

* 

* 

* 

* 


;Save sign of number 


, lf RO<0Ox20, 

;do linear coding 

; If RO>OxFFF, 

;saturate the result 

;Eliminate rightmost bit 
;Normalize: (seg+3) OWXYZx...x 
;Adjust segment number by 2**(-3) 
7 (Seg) WXYZx...x 


;Treat number as integer 
;Right-justify 

;Delayed return 

;set sign bit 

,;RO = compressed number 

;Invert even bits for transmission 
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Example 6—4.A-Law Expansion 


ITLE 


A-LAW EXPANSIO 


SUBROUTINE AXPND 


TYPICAL CALLING SEQUENCE: 


CYCLES: 15/13 (worst/best - not including subroutine overhead) 


;Shift and put result in RO 


;If positive, delayed return and 
jannul next three instructions 
;Negate if a negative number 


* 
* 
* 
* 
* 
* 
* LAJU  AXPND 
*  LDI v, RO 
* NOP <Sape can be other non-pipeline-break 
* ~~ NOP <=S=2 instructions 
* 
* ARGUMENT ASSIGNMENTS: 
* 
- ARGUMENT FUNCTION 
* oy 
- RO v = NUMBER TO BE CONVERTED 
* 
* REGISTERS USED AS INPUT: RO 
* REGISTERS MODIFIED: RO, R1, R2 
* REGISTER CONTAINING RESULT: RO 
* 
* BENCHMARKS: 
* WORDS: 15 (not including subroutine overhead) 
* 
* 
global AXPND 
* 
AXPND XOR OD5H,RO,R2 ;Invert even bits 
ASH3 -4,R2,R0 ;Store for bit sign 
AND 7,RO ;Isolate segment cod 
BZD SKIP1 
AND3 OFH,R2,R1 ;Isolate quantization bin 
LSH dis 
ADDI 1,R1 ;Create Oxxxxl 
ADDI 32,R1 7;Or 1xxxxl 
SUBI 1,R0 
SKIP1 LSH3 RO,R1,RO 
TSTB 80H, R2 ;Test sign bit 
BZAT Rid 
EGI RO 
OP 
OP 
BU R11 ;Return 
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6.2 FIR, IIR, and Adaptive Filters 


6.2.1 


FIR Filters 


Digital filters are a common requirement for digital signal processing systems. 
There are two types of digital filters: finite impulse response (FIR) and infinite 
impulse response (IIR). Each of these types can have either fixed or adaptable 
coefficients. In this section, the fixed-coefficient filters are presented first, and 
then the adaptive filters are discussed. 


If the FIR filter has an impulse response h[0], h[1],..., h[N—1], and x[n] repre- 
sents the input of the filter at time n, the output y[n] at time n is given by this 
equation: 


y{n] = h[O] x[n] + [1] x[(n—1] +... + H[N-1] x[n—(N-1)] 


Two features of the ’C4x that facilitate the implementation of the FIR filters are 
parallel multiply/add operations and circular addressing. The first permits the 
performance of a multiplication and an addition in a single machine cycle, while 
the second makes a finite buffer of length N sufficient for the data x. 


Figure 6—1 shows the arrangement of the memory locations to implement cir- 
cular addressing, while Example 6—5 presents the ’C4x assembly code for an 
FIR filter. 


Figure 6—1. Data Memory Organization for an FIR Filter 
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ow 
address h(N 1) oldest input | x[n-(N-1)] x(n) 
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h(1) x(n —1) x(n — 2) 
high h(0) newest input x(n) x(n —-1) 
address 


To set up circular addressing, initialize the block-size register BK to block 
length N. Also, the locations for signal x should start from a memory location 
whose address is a multiple of the smallest power of 2 that is greater than N. 
For instance, if N = 24, the first address for x should be a multiple of 32 (the 
lower 5 bits of the beginning address should be zero). To understand see Cir- 
cular Addressing in the TMS320C4x User’s Guide. 


In Example 6—5, the pointer to the input sequence x is incremented and as- 
sumed to be moving from an older input to a newer input. At the end of the sub- 
routine, AR1 will point to the position for the next input sample. 
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Example 6—5.FIR Filter 


* 
* TITLE FIR FILTER 
* 
* 
* SUBROUTINE FIR 
* 
* EQUATION: y(n) = h(O) * x(n) + h(1) * x(n-1) + 
* ... + H(N-1) * x(n-(N-1)) 
* 
* TYPICAL CALLING SEQUENCE 
* 
* LOAD ARO 
* LAJU FIR 
* LOAD ARI 
* LOAD RC 
* LOAD BK 
* 
* 
* ARGUMENT ASSIGNMENTS: 
* 
* ARGUMENT | FUNCTION 
x + 
- ARO | ADDRESS OF h(N-1) 
* AR1 | ADDRESS OF x(N-1) 
x RC | ENGTH OF FILTER - 2 (N-2) 
* BK | /ENGTH OF FILTER (N) 
* 
a REGISTERS USED AS INPUT: ARO, AR1, RC, BK 
* REGISTERS MODIFIED: RO, R2, ARO, AR1, RC 
i REGISTER CONTAINING RESULT: RO 
* 
* 
* BENCHMARKS: CYCLES: 3 + N (not including subroutine overhead) 
* WORDS: 6 (not including subroutine overhead) 
* 
* 
FIR -global FIR 
* 
RPTBD CONV ;Set up the repeat cycle 
* Initialize RO: 
MPYF3 *ARO++ (1),*AR1++(1)%,RO ;h(N-1) *x(n-(N-1)) -—>RO 
LDF 0.0,R2 ;Initialize R2 
NOP 
* 
* FILTER (1 <= i < N) 
* 
CONV MPYF3 *ARO++(1),*AR1++(1)%,RO ;h(N-1-1) *x(n-(N-1-i) )->RO 
1 | ADDF3 RO,R2,R2 ;Multiply and add operation 
* 
BUD R11 ;Delayed return 
ADDF RO, R2,RO0O ;Add last product 
NOP 
NOP 
* 
* end 
* 
-end 
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IIR Filters 


FIR, IIR, and Adaptive Filters 


The transfer function of the IIR filters has both poles and zeros. Its output de- 
pends on both the input and the past output. As a rule, the filters need less 
computation than an FIR with similar frequency response, but the filters have 
the drawback of being sensitive to coefficient quantization. Most often, the IIR 
filters are implemented as a cascade of second-order sections called biquads. 
Example 6-6 and Example 6—7 show the implementation for one biquad and 
for any number of biquads, respectively. 


y[n] = a1 y[n—1] + a2 y[n—2] + bO x[n] + 61 x[n—1] + b2 x[n-2] 
However, the following two equations are more convenient and have smaller 
storage requirements: 


d[n] = a2 d[n—2] + a1 d[n—1] + x[n] 
y[n] = b2 d[n—2] + b1 d[n—1] + bO d[n] 


Figure 6—2 shows the memory organization for this two-equation approach to 
the implementation of a single biquad on the ’C4x. 


Figure 6—2. Data Memory Organization for a Single Biquad 
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As in the case of FIR filters, the address for the start of the values d must be 
a multiple of 4; that is, the last two bits of the beginning address must be zero. 
The block-size register BK must be initialized to 3. 
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Example 6-6. IIR Filter (One Biquad) 


* TITLE IIR FILTER 

* 

* SUBROUTINE IIR1 

* 

x IIR1 == IIR FILTER (ONE BIQUAD) 

* 

* EQUATIONS: d(n) = a2 * d(n-2) + al * d(n-1) + x(n) 

y(n) = b2 * d(n-2) + bl * d(n-1) + bO * d(n) 
* 

* OR y(n) = al*y(n-1) + a2*y(n-2) + bO*x(n) + b1*x(n-1) 
* + b2*x(n-2) 

* 

* 

* TYPICAL CALLING SEQUENCE: 

* 

* load R2 

* LAJU  TIR1 

* load ARO 

* load AR1 

* load B 

* 

* 

* ARGUMENT ASSIGNMENTS: 

* ARGUMENT FUNCTION 

, SR en ee enna eens +--— Pa a a a a pa a aN i Sa a Sap aa a Ny te la pa al Ein Ry ae a 
* R2 INPUT SAMPLE X(N) 

* ARO ADDRESS OF FILTER COEFFICIENTS (A2) 

* AR1 ADDRESS OF DELAY MODE VALUES (D(N-2)) 

* BK BK = 3 

* 

* REGISTERS USED AS INPUT: R2, ARO, AR1, BK 

* REGISTERS MODIFIED: RO, R1, R2, ARO, AR1 

- REGISTER CONTAINING RESULT: RO 

* 

% BENCHMARKS: CYCLES: 7 (not including subroutine overhead) 
a WORDS: 7 (not including subroutine overhead) 
* 

* 

global IIR1 
* 

IIR1 PYF3 *ARO, *AR1, RO ;a2 * d(n-2) -> RO 

PYF *++ARO (1), *AR1- -(1)%,R1 7b2 * d(n-2) -> RI 
* 

PYES *++ARO (1), *AR1, RO jal * d(n-1) -> RO 
|| ADDF RO,R2,R2 pa2*d(n-2)+x(n) -> R2 
* 

PYF3 *++ARO (1), *AR1-- (1) %,RO jbl * d(n-1) -> RO 
|| ADDF3 RO,R2,R2 7al*d(n-1)+a2*d(n-2) 

7+x(n) -> R2 
* 

BUD R11 ;Delayed return 
*x 

PYF3 *++ARO (1),R2,R2 7b0 * d(n) -> R2 
|| STF R2,*AR1++(1)% ;Store d(n) and point to d(n-1) 
* 

ADDF RO, R2 7b1*d(n-1)+b0*d(n) -> R2 
ADDF R1,R2,RO0 ;b2*d(n-2) +b1*d(n-1) 
7+bO0*d(n) -> RO 
* 
* end 
* 
-end 


FIR, IIR, and Adaptive Filters 


Generally, the IIR filter contains N>1 biquads. The equations for its implemen- 
tation are given by the following pseudo-C language code: 


y[0,n] = x(n] 

for (i=0; i<N; i++){ 
d[i,n] = a2[i] d[in—2] + at [i] d[ijn—1] + y[i-1,n] 
y[in] = b2[i] d[i-2] + b1 [i] dfisn—1] + bOfi] dfi,n] 


} 
y[n] = y[N-1,n] 


Figure 6—3 shows the memory organization, and Example 6—7 shows the cor- 
responding 'C4x assembly-language code. 


Figure 6—3. Data Memory Organization for N Biquads 


filter initial delay final delay 
coefficients node values node values 


low 
address 


circular queue 


d(N -1, n) 
d(N -1, n-1) circular queue 
d(N -1, n- 2) 


high bO(N —1) 
address 


The block size register BK should be initialized to 3, and each set of d values 
(i.e., d[i,n], i = 0...N—-1) should begin at an address that is a multiple of 4 (the 
last two bits zero), as stated in the case of a single biquad. 
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Example 6—7.1IR Filter (N > 1 Biquads) 


* 
x TITLE IIR FILTER (N > BIQUADS) 
x 
* SUBROUTINE IIR2 
* 
* EQUATIONS: y(0,n) = x(n) 
x 
* FOR (i = 0; i < N; itt) 
*x 
SEK apa) a2(i) * d(i,n-2) + al(i) * d(i,n-1) * y(i-1,n) 
* y(i,n) = b2(i) * d(i,n-2) + b1l(i) * d(i,n-1) * bO(i) * d(i,n) 
* 
* y(n) = y(N-1,n) 
* 
* TYPICAL CALLING SEQUENCE: 
x 
* load R2 
* load ARO 
* load ARI 
* load IRO 
* LAJU IIR2 
* load IRI 
* load BK 
* load = RC 
*x 
* ARGUMENT ASSIGNMENT: 
* ARGUMENT FUNCTION* 
* pane mee Pippa RY Sa ee Ree SR Or cS ee Ee ee ee 
* R2 INPUT SAMPLE x(n) 
% ARO ADDRESS OF FILTER COEFFICIENTS (a2(0)) 
* AR1 ADDRESS OF DELAY NODE VALUES (d(0,n-2)) 
* BK BK = 3 
* IRO IRO = 4 
* IRL IR1 = 4*N-4 
- RC UMBER OF BIQUADS (N) -2 
* 
* REGISTERS USED AS INPUT; R2, ARO, AR1, IRO, IR1, BK, RC 
* REGISTERS MODIFIED; RO, Rl, R2, ARO, AR1, RC 
* REGISTERS CONTAINING RESULT: RO 
* 
* BENCHMARKS: CYCLES: 2 + 6N (not including subroutine overhead) 
% WORDS: 15 (not including subroutine overhead) 
* 
x 
-global IIR2 
* 
IIR2 MPYF3 *ARO, *AR1, RO 7a2(0) * d(0,n-2) -> RO 
MPYF3 *ARO++ (1) ,*AR1—-—(1)%,R1;b2(0) * d(0,n-2) -> RI 
* 
RPTBD LOOP ;Set loop for 1 <=i<n7 
x 
MPYF3 *++ARO (1),*AR1,RO 7al(0) * D(0,n-1) -> RO 
| | ADDF RO,R2,R2 ;First sum term of d(0,n). 
* 
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Example 6—7.IIR Filter (N > 1 Biquads) (Continued) 


** 


* 


* 
* 


LOOP STARTS HERE 


LOOP 


MPYF'3 
ADDF3 
MPYF'3 
STF 


MPYF3 
ADDF3 


PYF3 
ADDF3 
PYF3 
ADDF3 


PYF3 
ADDF3 


PYF3 
STF 


FINAL SUMMATION 


end 


ADDF3 
BRD 


ADDF 
NOP 
NO. 


aa 


.end 


*++ARO (1),*AR1—-(1)%,RO ;b1(0) * d(0,n-1) -—> RO 
RO,R2,R2 @Secona sum term of d(0,n) 
*++AR0 (1),R2,R2 7b0(0) * d(0,n) -—> R2 

R2, *AR1--—(1)% ;Store d(0,n) point to d(0,n-2) 
*++AR0 (1), *++AR1(IRO),RO;a2(i)* d(i,n-2) —> RO 

RO,R2,R2 ;First sum term of y(i-1,n) 


;Pipeline hit on previous 
; instruction 


*++ARO(1),*ARI—-—(1)%,R1;b2(i) * D(i,n-2) -> R1 
R1,R2,R2 ;Second sum term of y(i-1,n). 
*++ARO (1), *AR1,RO Cali)’ % adits n=l) -=S°-R0 
RO,R2,R2 ;First sum term of d(i,n) 
*++ARO(1),*ARI—-—(1)%,RO;b1(i) * d(i,n-1) -> RO 
RO,R2,R2 ;Second sum term of d(i,n). 
*++ARO0(1),R2,R2 ;bO(i) * d(i,n) -> R2 

Ro, *ARI==<4 1) 3 ;Store d(i,n) point to d(i,n-2) 
R1,R2,RO ;Second sum term of y(n-1,n 
R11 ;Delayed return 

RO,R2 ;First sum term of y(n-1,n) 
*AR1-—-—(IR1) ;Return to first biquad 
FART (13-9 sPoint to d(0,n-1) 


6.2.3 Adaptive Filters (LMS Algorithm) 


In some applications in digital signal processing, a filter must be adapted over 
time to keep track of changing conditions. The book Theory and Design of 
Adaptive Filters by Treichler, Johnson, and Larimore (Wiley-Interscience, 
1987) presents the theory of adaptive filters. Although in theory, both FIR and 
IIR structures can be used as adaptive filters, the stability problems and the 
local optimum points that the IIR filters exhibit make them less attractive for 
such an application. Hence, until further research makes IIR filters a better 
choice, only the FIR filters are used in adaptive algorithms of practical applica- 
tions. 


In an adaptive FIR filter, the filtering equation takes this form: 
y[n] = h[n,0] x[n] + h[n,1]x[n—1] +...+ h[n, N—1]x[n—(N—1)] 


The filter coefficients are time-dependent. In a least-mean-squares (LMS) al- 
gorithm, the coefficients are updated by an equation in this form: 
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h[n+1,i] = h[n,1] + b x{n-i], i= 0, 1, ..., N-1 


bis aconstant for the computation. The updating of the filter coefficients can 
be interleaved with the computation of the filter output so that it takes 3 cycles 
per filter tap to do both. The updated coefficients are written over the old filter 
coefficients. Example 6-8 shows the implementation of an adaptive FIR filter 
on the ’C4x. The memory organization and the positioning of the data in 
memory should follow the same rules as the above FIR filter with fixed coeffi- 
cients. 


FIR, IIR, and Adaptive Filters 


Example 6—8.Adaptive FIR Filter (LMS Algorithm) 


+ + + + FF F FF FF FF F FF F FF FF FF F FF F FF F FF F F FF FF F| 


H 
Ss 
n 


ITLE ADAPTIVE FIR FILTER (LMS ALGORITHM) 
SUBROUTINE LMS 
LMS == LMS ADAPTIVE FILTER 
EQUATIONS: y(n) = h(n,0)*x(n) + h(n,1)*x(n-1) + 
+ h(n,N-1) *x(n-(N-1) ) 
FOR i = 0; i < N; itt) h(n+1,i) = h(n,i) 
+ tmuerr * x(n-i) 
TYPICAL CALLING SEQUENCE: 
load R4 
load ARO 
ATU MS 
load AR1L 
load RC 
load BK 
ARGUMENT ASSIGNMENTS: 
ARGUMENT | FUNCTION 
+ 
R4 | SCALE FACTOR (2 * mu * err) 
ARO | ADDRESS OF h(n,N-1) 
AR1 | ADDRESS OF x(n-(N-1)) 
RC | LENGTH OF FILTER - 2 (N-2) 
BK | LENGTH OF FILTER (N) * 
REGISTERS USED AS INPUT: R4, ARO, AR1, RC, BK 
REGISTERS MODIFIED: RO, R1, R2, ARO, AR1, RC 
REGISTER CONTAINING RESULT: RO 
BENCHMARKS: CYCLES: 4 + 3N (not including subroutine overhead) 
PROGRAM SIZE: 9 words (not including subroutine overhead) 
SETUP (i = 0) 
-global LMS 
RPTBD LOOP ;Setup the delayed repeat block 
Initialize RO: 
MPYF3 *ARO, *AR1,RO ;h(n,N-1) * x(n-(N-1)) -—> RO 
SUBF3 RZ,» Ra, RZ ; Initialize R2 
Initialize RI1: 
MPYF3 *AR1++(1)%,R4,R1 ;x(n-(N-1)) * tmuerr -> RI 
ADDF3 *ARO++(1),R1,R1 ;h(n,N-1) + x(n-(N-1)) * 
;tmuerr —-> RL 
FILTER AND UPDATE (1 <= I < N) 
Filter: 
MPYF3 *ARO-—-—(1),*AR1,RO ;h(n,N-1-i) * x(n-(N-1-i)) -> RO 
ADDF3 RO,R2,R2 ;Multiply and add operation. 
UPDATE: 
MPYF3 *AR1++(1)%,R4,R1 ;x(n,N-(N-1-i)) * tmuerr -> RI 
STF R1, *ARO++ (1) ;R1l -—> h(nt+1,N-1-(i-1) ) 
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Example 6—8.Adaptive FIR Filter (LMS Algorithm) (Continued) 


* 


* 


* 


LOOP 


end 


ADDF3 


BUD 


ADDF3 
STF 


NOP 


.end 


*ARO++(1),R1,R1 


R11 


RO,R2,RO 
R1, *-ARO (1) 


‘Hin, NHL) + xe Ca ONT) ) 
;*tmuerr -> R1 


;Delayed return 


;Add last product. 
;h(n,0) + x(n)* tmuerr -> 
pre(netl, OG) 
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Lattice Filters 


The lattice form is an alternative way of implementing digital filters; it has appli- 
cations in speech processing, spectral estimation, and other areas. In this dis- 
cussion, the notation and terminology from speech processing applications 
are used. 


If H(z) is the transfer function of a digital filter that has only poles, A(z) = 1/H(z) 
will be a filter having only zeros, and it will be called the inverse filter. The in- 
verse lattice filter is shown in Figure 6-4. These equations describe the filter 
in mathematical terms: 


f(iin) = f(iH1,n) + k(i) b(i-1,n-1) 
b(i,n) = b(iH+1,n-1) + k(i) f(iH1,n) 
Initial conditions: 

f(0,n) = b(0,n) = x(n) 

Final conditions: 

y(n) = f(p,n) 


In the above equation, f(i,n) is the forward error, b(i,n) is the backward error, 
k(i) is the i-h reflection coefficient, x(n) is the input, and y(n) is the output signal. 
The order of the filter (that is, the number of stages) is p. In the linear predictive 
coding (LPC) method of speech processing, the inverse lattice filter is used 
during analysis, and the (forward) lattice filter is used during speech synthesis. 


Figure 6—4. Structure of the Inverse Lattice Filter 


anil) Sea 


n) 


b(0 


eee 


Figure 6—5 shows the data memory organization of the inverse lattice filter on 
the ’C40. 


Applications-Oriented Operations 6-17 


Lattice Filters 


Figure 6—5. Data Memory Organization for Inverse Lattice Filters 


reflection 
coefficients 


backward 
propagation terms 


e e 
e e 
e e 
wast 


Example 6-9. Inverse Lattice Filter 


* ITLE INVERSE LATTICE FILTER 

* 

* SUBROUTINE LATINV 

* 

% LATINV == LATTICE FILTER (LPC INVERSE FILTER — ANALYSIS) 

* 

* TYPICAL CALLING SEQUENC 

* 

* oad R2 

* LAJU  LATINV 

i oad ARO 

* oad ARI 

ig oad RC 

* 

* 

* ARGUMENT ASSIGNMENTS: 

* ARGUMENT | FUNCTION 

* 

* R2 | £(0,n) = x(n) 

* ARO | ADDRESS OF FILTER COEFFICIENTS (k(1)) 

* AR1 | ADDRESS OF BACKWARD PROPAGATION VALUES (b(0,n-1)) 

bs RC | RC =p - 2 

* 

* REGISTERS USED AS INPUT: R2, ARO, AR1, RC 

* REGISTERS MODIFIED: RO, R1, R2, R3, RS, RE, RC, ARO, ARI 

* REGISTER CONTAINING RESULT: R2 (f(p,n)) 

* 

BENCHMARKS: CYCLES: 3 + 3p (not including subroutine overhead) 
PROGRAM SIZE: 9 WORDS (not including subroutine overhead) 

* 

* 

* 

* 
.global LATINV 

* 

k* i=l 

* 

LATINV RPTBD LOOP ;Setup the delayed repeat block loop 
MPYF3 *ARO, *AR1,RO ;k(1) * b(O,n-1) -> RO 

;Assume £(0,n) —-> R2. 

LDF R2,R3 ;Put b(0,n) = £(0,n) -—> R3. 
MPYF3 *ARO++(1),R2,R1 ae) *— £004) >: RA. 
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Example 6-9. Inverse Lattice Filter (Continued) 


* 
x* 2 <= i <= p (Repeat block loop start here) 
* 
MPYF3 *ARO, *++AR1 (1),RO 7k(i) * b(i-1,n-1) -> RO 
| | ADDF3 R2,R0,R2 pf£(i-1-1,n) + k(i-1) *b(i-1-1,n-1) 
;= £(1i-1,n) -> R2 
* 
;b(i-1-1,n-1) + k(i-1)*f£(i-1-1,n) 
ADDF3 *-AR1(1),R1,R3 7= b(i-1,;n) -> R3 
| | STF R3, *-AR1 (1) sb(i-1-1,n) -> b(i-1-1,n-1) 
* 
LOOP MPYF3 *ARO++(1),R2,R1 ;k(i) * £(i-1,n) -> RI 
* 
* I = P + 1 (CLEANUP) 
* 
BUD R11 ;Delayed return 
ADDF3 R2,R0,R2 7f£(p-1,n) + k(p)*b(p-1,n-1) 
7= £(p,n) -> R2 
* 
ADDF3 *AR1,R1,R3 zb(p-1,n-1) + k(p)*£(p-1,n) 
7= b(p,n) —> R3 
|| STF R3, *AR1 ;b(p-1,n) -—> b(p-1,n-1) 
NOP 
* 
* end 
* 
.end 


The structure of the forward lattice filter, shown in Figure 6-6, is similar to that 
of the inverse filter (also shown in the figure). These corresponding equations 
describe the lattice filter: 
f(i-1,n) = f(i,n) — k(i) b(i-1,n—1) 
b(i,n) = b(i-1,n—-1) + k(i) f(i-1,n) 
Initial conditions: 

f(p,n) = x(n), b(i,n—1) = 0 fori=1,...,p 
Final conditions: 

y(n) = £(0,n). 


The data memory organization is identical to that of the inverse filter shown in 
Figure 6—5. Example 6-10 shows the implementation of the lattice filter on the 
’CA4x, 


Figure 6—6. Structure of the Forward Lattice Filter 


= fp, f(2, n f(1,n n 
x(n (p, n) > «4 (2, n) > (1, n) > y( a 
—Kp —K2 -K1 
Kp K2 K1 
b(p, n) b(2, n) b(1, n) 
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Example 6—10. Lattice Filter 


* TITLE LATTICE FILTER 
* 
* SUBROUTINE LATTICE 
x 
. LAJU ATTICE 
* LOAD ARO 
* LOAD AR1 
* LOA RC 
* 
* ARGUMENT ASSIGNMENTS: 
7 ARGUMENT | FUNCTION 
* + 
* R2 | F(P,N) = E(N) = EXCITATION 
2 ARO | ADDRESS OF FILTER COEFFICIENTS (K(P)) 
* AR1 | ADDRESS OF BACKWARD PROPAGATION 
* | VALUES (B(P-1,N-1)) 
* RC | RC =P - 2 
x 
* REGISTERS USED AS INPUT: R2, ARO, AR1, RC 
* REGISTERS MODIFIED: RO, Rl, R2, R3, RS, RE, RC, ARO, ARI 
* REGISTER CONTAINING RESULT: R2 (f£(0,n)) 
* 
* BENCHMARKS: CYCLES: 1 + 5P (not including subroutine overhead) 
a PROGRAM SIZE: 11 words (not including subroutine overhead) 
* 
-global AATTICE 
*x 
LATTICE RPTBD LOOP ;Setup the delayed repeat block loop 
MPYF3 *ARO, *AR1,RO ;K(P) * B(P-1,N-1) -> RO 
SUBF3 RO, R2,R2 ;Assume F(P,N) —> R2 
NOP ;F (P,N) -K(P) *B(P-1,N-1) 
;= F(P-1,N) -> R2 
* 
* 2 <= I <= P (Repeat block loop start here) 
* 
MPYF3 *ARO,R2,R1 ;K(I) * F(I-1,N) -> Rl 
MPYF3 *—-ARO(1),*-AR1(1),RO ;K(I-1) * 
7B(I-1-1,N-1) -> RO 
ADDF3 *AR1--(1),R1,R3 ;B(I-1,N-1) + K(1I)*F(I-1,N) 
* 7= B(I,N) -> R3 
STF R3, *+AR1 (2) ;B(I,N) -> B(I,N-1) 
LOOP SUBF3 RO,R2,R2 ;F (1I-1,N) -K (1-1) 
;*B(I-1-1,N-1) 
x ;= F(I-1-1,N) -> R2 
x 
* I = 1 (CLEANUP) 
x 
BUD R11 ;Delayed return 
MPYF *ARO,R2,R1 ;K(1) * F(O,N) -—> RI 
ADDF3 *AR1,R1,R3 7B(0,N-1) + K(1)*F(0,N) 
* ;= B(1,N) -—> R83 
STF R3,*+AR1 (1) ;B(1,N) -> B(1,N-1) 
I | STF R2, *AR1L ;F(0,N) -—> B(0,N-1) 
* 
* end 
* 
-end 
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6.4 Matrix-Vector Multiplication 


In matrix-vector multiplication, a K x N matrix of elements m(i,j), having K rows 
and N columns, is multiplied by an N x 1 vector to produce a K x 1 result. The 
multiplier vector has elements v(j), and the product vector has elements p(i). 
Each one of the product-vector elements is computed by the following expres- 
sion: 


p(i) = m(i,0) v(0) + m(i,1) v(1) +...4 m(i,N-1) v(N-1) i = 0,1,...,K-1 


This is essentially a dot product, and the matrix-vector multiplication contains, 
as a special case, the dot product presented in Example 2-1 on page 2-3 and 
Example 2—2 on page 2-5. In pseudo-C format, the computation of the matrix 
multiplication is expressed by 
for (i = 0; i < K; i++) { 

p(i) = 0 

for (j = 0; j < N; j++) 

p(i) = pi) + m(i,j) * vg) 

} 
Figure 6—7 shows the data memory organization for matrix-vector multiplica- 
tion, and Example 6-11 shows the ’C4x assembly code that implements it. 
Note that in Example 6—11, K (number of rows) should be greater than 0, and 
N (number of columns) should be greater than 1. 


Figure 6—7. Data Memory Organization for Matrix-Vector Multiplication 


input result 
lav matrix storage vector storage vector storage 
e e e 
e e e 
e e e 
v(N 1) 
p(k - 1) 
high m(1, 1) 
address e 
e 
e 
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Example 6-11. Matrix Times a Vector Multiplication 


TITLE MATRIX TIMES A VECTOR MULTIPLICATION 


SUBROUTINE MAT 


MAT 


MATRIX TIMES A VECTOR OPERATION 


TYPICAL CALLING SEQUENCE: 


(not including subroutine overhead) 


PROGRAM SIZE: 


* 

* 

* 

* 

* 

* 

* 

* 

* 

‘ load ARO 

a load AR1 

is load AR2 

‘ load AR3 

x load R1 

* CALL MAT 

* 

7 ARGUMENT ASSIGNMENTS: 

* 

* ARGUMENT | FUNCTION 

Fi a a a a 4+—----—--—-—~-—-—-—-—-~-—---—--—---—-------- - 
* ARO ADDRESS OF M(0,0) 

* AR1 ADDRESS OF V(0) 

* AR2 ADDRESS OF P(0) 

AR3 NUMBER OF ROWS - 1 (K-1) 

* RC NUMBER OF COLUMNS - 2 (N-2) 
* 

2 REGISTERS USED AS INPUT: ARO, AR1, AR2, AR3, RC 
REGISTERS MODIFIED: RO, R2, ARO, AR1, AR2, AR3, IRO, RC 
* 

* 

* MATRIX —-VECTOR BENCHMARKS: CYCLES: 1 + 7K + KN = 1 + K (N + 7) 
* 

* 


10 words (not including subroutine 


overhead) 
* 
* 
-global MAT 
* 
* SETUP 
* 
MAT ADDI3 RC, 2, IRO ;IRO =N 
* 
* FOR (i = 0; i < K; i++) LOOP OVER THE ROWS. 
* 
ROWS RPTBD DOT ;Setup multiply a row by a column 
7Set loop counter 
LDF Q0.0,R2 ;Initialize R2 
MPYF3 *ARO++(1),*AR1++(1),RO ;m(i,0) * v(0) -—> RO 
NOP 
. FOR (j = 1; 3 < N; j++) DO DOT PRODUCT OVER COLUMNS 
* 
DOT MPYF3  *ARO++(1),*AR1++(1),RO0 ;m(i,j) * v(3) -> RO 
1 | ADDF3 RO,R2,R2 PME pay SS ea 
#R2. => RZ 
* 
DBD AR3, ROWS ;counts the number of rows left 
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Matrix-Vector Multiplication 


+ + % OF 


ADDF RO, R2 ;last accumulate 
STF R2, *AR2++ (1) ;result -> p(i) 
NOP *— —AR1(IRO) ;set AR1 to point to v(0) 


!!! DELAYED BRANCH HAPPENS HERE !!! 
RETURN SEQUENCE 

RETS ;return 
end 


-end 
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6.5 Fast Fourier Transforms (FFTs) 
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Fourier transforms are an important tool often used in digital signal processing 
systems. The transform converts information from the time domain to the fre- 
quency domain. The inverse Fourier transform converts information back to 
the time domain from the frequency domain. Implementation of Fourier trans- 
forms that are computationally efficient are known as fast Fourier transforms 
(FFTs). The theory of FFTs can be found in books such as DFT/FFT and Con- 
volution Algorithms by C.S. Burrus and T.W. Parks (John Wiley, 1985) and Dig- 
ital Signal Processing Applications With the TMS320 Family. 


’C4x features that increase efficient implementation of numerically intensive 
algorithms are particularly well-suited for FFTs. The high speed of the ’C4x 
(40-ns cycle time) makes the implementation of real-time algorithms easier, 
while the floating-point capability eliminates the problems associated with dy- 
namic range. The powerful indexing scheme in indirect addressing facilitates 
the access of FFT butterfly legs that have different spans. The repeat block 
implemented by the RPTB or RPTBD instruction reduces the looping over- 
head in algorithms heavily dependent on loops (such as the FFTs). This gives 
the efficiency of in-line coding with the form of a loop. Since the output of the 
FFT is in scrambled (bit-reversed) order when the input is in regular order, it 
must be restored to the proper order. This rearrangement does not require ex- 
tra cycles. The device has a special form of indirect addressing (bit-reversed 
addressing mode) that can be used when the FFT output is needed. 


The ’C4x can implement the bit-reversed addressing mode on either the CPU 
or DMA. This mode makes it possible to access the FFT output in the proper 
order. If the DMA transfer with bit-reversed addressing mode is used, there is 
no overhead for data input and output. 


There are several types of FFT examples in this section: 


[1 Radix-2 and radix-4 algorithms, depending on the size of the FFT 
butterfly 


_j Decimation in time or frequency (DIT or DIF) 

[1 Complex or real FFTs 

Lj) FFTs of different lengths, etc. 

The following C-callable FFT code examples are provided in this section: 
[1 Complex radix-2 DIF FFT: subsection 6.5.1 

[1 Complex radix-4 DIF FFT: subsection 6.5.2 

[1 Faster Complex radix-2 DIT FFT: subsection 6.5.3 

[j Real radix-2 DIF FFT: subsection 6.5.4 
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Code for these different FFTs can be found in the DSP Bulletin Board Service 
(under the filename: C40FFT.EXE). This file includes code, input data and sine 
table examples, and batch files for compiling and linking. For instructions on 
how to access the BBS, see subsection 10.1.3, The Bulletin Board Service 
(BBS). To use these FFT codes, you need to perform two steps: 


(J Provide a sine table in the format required by the program. This sine table 
is FFT size specific, with the exception of the sine table required for 
Complex radix-2 DIT and the real radix-2 DIF FFT programs (as noted in 
Example 6-18) 


(1 Align the input data buffer on an+1 memory boundary, i.e the n+1 LSBs 
of the input buffer base address must be zero. (n = log FFT_SIZE). 


For most applications, the ‘C4x quickly executes FFT lengths of up to 1024 
points (complex) or 2048 points (real) because it can do so almost entirely in 
on-chip memory. 


For FFTs larger than 1024 (complex), see the application report, Parallel 1-D 
FFT Implementation with the TMS320C4x DSPs, in the book Parallel Proces- 
sing Applications with the TMS320C4x DSP (literature number SPRA031). 
This application note covers unprocessed partitioned FFT implementation for 
large FFTs. The source code is also available on the TI DSP Bulletin Board (un- 
der the filename: C40PFFT.EXE). 
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6.5.1 Complex Radix-2 DIF FFT 


Example 6—12 shows a simple implementation of a complex radix-2, DIF FFT 
on the ’C4x. The code is generic and can be used with any length number. 
However, for the complete implementation of an FFT, a table of twiddle factors 
(sines/cosines) is needed, and this table depends on the size of the transform. 
To retain the generic form of Example 6—12, the table with the twiddle factors 
(containing 1-1/4 complete cycles of a sine) is presented separately in 
Example 6-13 for the case of a 64-point FFT. A full cycle of a sine should have 
a number of points equal to the FFT size. If the table with the twiddle factors 
and the FFT code are kept in separate files, they should be connected at link 
time. 
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KKK KKK KKK KK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KK 


* 

* FILENAME CR2DIF.ASM 

* DESCRIPTION COMPLEX, RADIX-2 DIF FFT FOR TMS320C40 (C callable) 

* DATE 6/29/93 

* VERSION 4.0 

* 

KKKKKKKKKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKKKK KKK KKK KKK KK 

* 

* VERSION DATE COMMENTS 

> a eee eye ears ee Fay Sean rete ve 
1.0 10/87 PANNOS PAPAMICHALIS (TI Houston) Original Release 
2.0 1/91 DANIEL CHEN (TI Houston): C40 porting 
3.0 7/1/92 ROSEMARIE PIEDRA (TI Houston): made it C-callable 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston): added support for 

in-place bit reversing 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK 


— The computation is done in-place. 


SYNOPSIS: int cr2dif (SOURCE_ADDR, FFT_SIZE, LOGFFT, DST_ADDR) 
ar2 2 63 ce 
float *SOURCE_ADDR ; input address 
int FFT_SIZE 764, 128, 256, 512, 1024, 
int LOGFET ;log (base 2) of FFT_SIZE 
float *DST_ADDR ;destination address 


Sections to be allocated in linker command file: 


.ffttxt FFT code 
.fftdat FFT data 


If SOURCE_ADDR=DST_ADDR, then in-plac 
D 


ESCRIPTION: 


is from the Burrus and Parks book, p. 


; 


++ + + + + + FF F FF FF FFF FF FF FF FF FF + FF FF FF FF FF 


KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK 


Generic program for a radix-2 DIF FFT computation using the TMS320C4x family. 
The computation is done in-place and the result is bit-—reversed. 


long with real and imaginary data in consecutive memory locations: 


[The twiddle factors are supplied in a table put in a section with a global 
label _SINE pointing to the beginning of the table. 


separate file to preserve the generic nature of the program. The sine table 

size is (5*FFT_SIZE) /4. 

Note: Sections needed in the linker command file: .ffttxt FFT code 
.fftdat FFT data 


bit reversing is performed 


The program 
The input data array is 2*FFT_SIZE- 
Re-Im-Re-Im 


111. 


This data is included in a 
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KEK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKKKKKKKKA KKK KKK KK 
* + 
* AR + 43 AI AR’ + 3 AI’ 
* \ / + 
* \ / 
* er 
* / \ 
* / \ 
* / \ + 
* BR + 4 BI Cos - j SIN ---- BR’ + j BI’ 
* = 
* 
x AR’= AR + BR 
* AI’= AI + BI 
7 BR’= (AR-BR)*COS + (AI-BI) *SIN 
a BI’= (AI-BI)*COS — (AR-BR) *SIN 
* 
KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKKKK KK KAKA KKK KKK 
* 
* 
-globl _ SINE ;Address of sine/cosine tabl 
-globl J2cr2dif. ;Entry point for execution 
-globl STARTB, ENDB ;starting/ending point for benchmarks 
-sect “eEEt ate” 
SINTAB .word _SINE 
OUTPUTP .space 1 
FFTSIZE .space ad 
-sect " ffttxt” 
ver2dait: 
LDI SP, ARO 
PUSH DP 
PUSH R4 ;Save dedicated registers 
PUSH R5 
PUSH R6 j;lower 32 bits 
PUSHF R6 ;upper 32 bits 
PUSH AR4 
PUSH AR5 
PUSH AR6 
PUSH R8 
LDP SINTAB 
f .REGPARM == 0 ;stack is used for parameter passing 
DI *-ARO (1) ,AR2 ;points input data 
LDI *-ARO (2) ,R10 ;RLO=N 
LDI *-ARO (3),R9 7R9 holds the remain stage number 
LDI *—-ARO (4) ,RC ;points where FFT result should move to 
-else ;registers are used for parameter passing 
LDI R2,R10 
LDI R3,R9 
endif 
STI RC, @OUTPUTP 
SFI R10, @FFTSIZE 
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;Initialize repeat counter of first loop 


;IRO=2*N1 (because of real/imag) 
; IRL=N/4, pointer for SIN/COS table 


;Initialize IE index (AR5=IE) 


;RC should be one less than desired # 


;Setup for first loop 
;N2=N2/2 

;ARO points to X(I) 
,;AR6 points to X(L) 


;X(I)=RO and... 
;X(L)=R1 and ARO,2 = ARO,2 + 2*n 


;Init loop counter for inner loop 

; Initialize IA index (AR4=IA) 

; IA=IA+IE;AR4 points to cosine 

7 (X(I),Y(1I)) pointer 

;RC should be one less than desired # 


;Setup for second loop 
7; (X(L),Y(L)) pointer 


; RO=SIN* 


; R2=X (1) -X (L) 
;R1=Y (I) -Y (L) 
;RO=R2*SIN and... 


;R3=Y (1) +Y¥ (L 


v 

v 

, 

, 
;R3 = R1 * COS and 

;Y (I) =Y (I) +Y(L) 

; R4=R1*COS-R2*SIN 
;RO=R1*SIN and... 
;R3=X (I) +X (L) 

7;RS. = -R2 * COS and... 
;X(I)=X(1I)+X(L) and ARO=AR0+2*N1 
;RO=R2*COS+R1*SIN 


STARTB: 
LDI 1,R8 
LSH3 1,R10,IRO 
LSH3 -2,R10,IR1 
LDI 1,AR5 
LSH 1,R10 
SUBI3 1,R8,RC 
as Outer loop 
LOOP: 
RPTBD BLK1 
LSH -1,R10 
LDI AR2,ARO 
ADDI R10,ARO,AR6 
* 
* 
* First loop 
* 
ADDF *ARO, *AR6, RO 
SUBF *AR6++, *ARO++,R1 
ADDF *AR6, *ARO, R2 
SUBF *AR6, *ARO,R3 
STF R2, *ARO-— 
|| STF R3, *AR6-— 
BLK1 STF RO, *ARO++ (IRO) 
|| STF R1, *AR6++ (IRO) 
* If this is the last stage, you are done 
SUBI 1,R9 
BZD ENDB 
7 main inner loop 
LDI 2,AR1 
LDI @SINTAB, AR4 
ADDI AR5,AR4 
ADDI AR2,AR1, ARO 
SUBI 1,R8,RC 
INLOP 
RPTBD BLK2 
ADDI R10, ARO, AR6 
ADDI 2,AR1 
LDF *AR4,R6 
* 
* Second loop 
SUBF *AR6, *ARO, R2 
SUBF *+AR6, *+ARO,R1 
MPYF R2,R6,RO0 
| ADDF *+AR6, *+ARO,R3 
MPYF R1, *+AR4(IR1),R3 
| STF R3, *+AR0 
SUBF RO,R3,R4 
MPYF R1,R6,RO 
| ADDF *AR6, *ARO, R3 
MPYF R2,*+AR4(IR1),R3 
| STF R3, *ARO++ (IRO) 
ADDF RO,R3,R5 
BLK2 STF R5, *AR6++ (IRO) 


;X(L)=R2*COS+R1I*SIN, incr AR6 and... 
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Example 6-12. Complex Radix-2 DIF FFT (Continued) 


r] STF R4, *+AR6 ; Y (L) =R1*COS—-R2*SIN 
CMP I R10,AR1 
BNEAF INLOP ;Loop back to the inner loop 
ADDI AR5, AR4 ; IA=IA+IE;AR4 points to cosine 
ADDI AR2,AR1, ARO 7 (X(1),Y(1)) pointer 
SUBI 1,R8,RC 
LSH 1,R8 ;Increment loop counter for next time 
BRD LOOP ;Next FFT stage (delayed) 
LSH 1,AR5 ; IE=2*1E 
LDI R10, IRO ;N1=N2 
SUBI3 1,R8,RC 
ENDB: 
* 
* 
KEKE KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKKKKKAKKKK KK KKK KK 
#------------- BITREVERSAL * 
* This bit-reversal section assume input and output in Re-Im-Re-Im format * 
KEK KKK KK KKK KK KKK KKK KEK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKKKKKAKK KKK KK KKK 
cmpi @OUTPUTP, ar2 
beqd INPLACE 
nop 
ldi @FFTSIZE,ir0O ;irdO = FFT_SIZE 
subi 27 300 re ;rc = FFT_SIZE-2 
;SRC different from DST 
jar2 = SRC_ADDR 
rptbd BITRV 
ldi 2, 41 jirl = 2 
ldi @OUTPUTP,arl j;arl = DST_ADDR 
ldf *tar2(1),r0 ;read first Im value 
ldf *ar2t+(ir0) b, rl 
| | Sto r0O, *+tarl (1) 
BITRV ldf *+ar2(1),xr0 
| | StL rl, *arl++ (irl) 
bud END 
ldf *earert (i700) by rl. 
| | stf r0, *tarl1 (1) 
nop 
stf EL earl 
INPLACE 
rptbd BITRV2 jin place bit reversing 
ldi ar2,arl 
nop *t++arl1 (2) 
nop *ar2++(ir0)b 
cmpi arl,ar2 
bgeat CONT 
ldf ‘aril, ro 
I | ldf kar2,r1 
stf r0O, *ar2 
1 | Sth rl, *arl 
ldf *tarl(1),r0 
I] ldf Atar? (1) EL 
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stf £05 * far2 (1) 
|| stf r1,*tarl1 (1) 
CONT nop *++arl (2) 
BITRV2 nop *kar2t++(ird)b 


, 
;Return to C environment. 


’ 


END: POP R8 
POP AR6 ;Restore the register values and return 
POP AR5 
POP AR4 
POPF R6 
POP R6 
POP R5 
POP R4 
POP DP 
RETS 

end 
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KKKKK KKK KK 


KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KEK KKK KK KKK KKK KKK 


TITLE 


TABLE WITH TWIDDLE FACTORS FOR A 64-POINT FFT 


FILE TO BE LINKED WITH THE SOURCE CODE FOR A 64-POINT, 


DIF COMPLEX FFT OR A RADIX-4 DIF COMPLEX FE 


SINE TABLE LENGTH = 5*FFTSIZE/4 


* 
* 
* 
* 
* RADIX-2 
* 
* 
* 
* 


KKKKKKKKK 


-globl 
-8ect 
_SINE 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
_COSINE 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 
.floa 


rete tetete te teste tet ett ct ct 


rtretetete teste te tet teste tte tet te tte ttt ct ct ct 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK 
_SINE 
eosa nab” 


- 000000 
-098017 
. 195090 
-290285 
- 382683 
-471397 
-555570 
- 634393 
- 707107 
. 773010 
- 831470 
~881921 
- 923880 
- 956940 
- 980785 
-995185 


ODOCWVOCVCCCOCCCOCCOCCO 


- 000000 
-995185 
- 980785 
- 956940 
- 923880 
-881921 
- 831470 
- 773010 
- 707107 
- 634393 
£9595:°70 
-471397 
- 382683 
-290285 
- 195090 
-098017 
000000 
-098017 
- 195090 
-290285 
-0.382683 
-0.471397 
-0.555570 
-0.634393 
-0.707107 
-0.773010 
-0.831470 
-0.881921 
-0.923880 


I OxrO: O10: 0: OOO. OO. ©..0':O:Or@:0: 


bs 
ooo: 
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loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
lost 
loat 
loat 
LOS 
loat 
loat 
loat 
loat 
loat 
Leak 
loat 
loat 
loat 
loat 
loat 
Leek 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 
loat 


Fh Eh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh Fh 


-0 


OO OR OROR OR OR ORONO ORO ORORORe) 


- 956940 
- 980785 
~995185 
-000000 
~995185 
- 980785 
- 956940 
- 923880 
~881921 
.831470 
. 773010 
. 707107 
- 634393 
-555570 
~471397 
382683 
~290285 
. 195090 
098017 


- 000000 
-098017 
- 195090 
-290285 
- 382683 
-471397 
-555570 
- 634393 
. 707107 
. 773010 
- 831470 
-881921 
- 923880 
- 956940 
- 980785 
-995185 


6.5.2 Complex Radix-4 DIF FFT 


The radix-2 algorithm has tutorial value because it is relatively easy to under- 
stand how the FFT algorithm functions. However, radix-4 implementations can 
increase the speed of the execution by reducing the overall arithmetic re- 
quired. Example 6-14 shows the generic implementation of a complex, DIF 
FFT in radix-4. A companion table like the one Example 6—13 should be used 
to provide the twiddle factor. 
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+ + + + FFF HF FH 


* 


+ + + + FF + + FF FF F FF F FF FF FF FF FF FF FF FF FF + F FF FF + 


KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK 


FILENAME : CR4DIF.ASM 

DESCRIPTION : COMPLEX, RADIX-4 DIF FFT FOR TMS320C40 (C callable) 
DATE : 6/29/93 

VERSION : 4.0 


KKK KKK KKK KKK KKK KK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KKK KKK KKK 


VERSION DATE COMMENTS 
1:0 10/87 PANNOS PAPAMICHALIS (TI Houston) 
Original Release 
2:0) 1/91 DANIEL CHEN (TI Houston): C40 porting 
3.0 7/1/91 ROSEMARIE PIEDRA (TI Houston): made it C-callable 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston):added support for 


in-place bit reversing. 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KK KKK KKK KKK 


SYNOPSIS: int cr4dif (SOURCE_ADDR, FFT_SIZE, LOGFFT,DST_ADDR) 
ar2 r2 v3 rc 
float *SOURCE_ADDR ;input address 
int FFT_SIZE 764, 256, 1024, 
int LOGFET ;log (base 4) of FFT_SIZE 
float *DST_ADDR ;destination address 
— The computation is done in-place. 
— Sections to be allocated in linker command file: .ffttxt : FFT code 
.fftdat : FFT data 


If SOURCE_ADDR=DST_ADDR, then in-place bit reversing is performed 


KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KK KKK KKK KKK 


iw) 


ESCRIPTION: 


Generic program for a radix-4 DIF FFT computation using the TMS320C4x 
family. The computation is done in-place and the result is bit-reversed. 
he program is taken from the Burrus and Parks book, p. 117. 

[The input data array is 2*FFT_SIZE-long with real and imaginary data 

in consecutive memory locations: Re-Im-Re-Im 


4 


- 


4 


[The twiddle factors are supplied in a table put in a section 
with a global label _SINE pointing to the beginning of the table 
This data is included in a separate file to preserve the generic 


nature of the program. The sine table size is (5*FFT_SIZE) /4. 


In order to have the final results in bit-reversed order, the two 
middle branches of the radix-4 butterfly are interchanged during 
storage. Note the difference when comparing with the program in p.117 
of the Burrus and Parks’ book. 
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* Note: Sections needed in the linker command file: .ffttxt FET code 
* .fftdat FFT data 
* 
KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKKKKA KKK KKK KKK KK 
* 
* WARNING: 
* 
* For optimization purposes, LDF *+AR1,RO (see **1**) will fetch memory outside 
* the input buffer range during the "first loop” execution (RC=0). Even though 
* the read value (RO) is not used in the code, this could cause a halt situa 
* tion if AR1 points to a no-ready external memory 
* 
KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKA KKK KKK KKK KK 
-globl SINE ;Address of sine/cosine tabl 
-globl cr4dif ;Entry point for execution 
-globl STARTB, ENDB ;starting/ending point for benchmarks 
.sect "i tttdat” 
FFTSIZ .space 1 
SINTAB .word SINE 
SINTAB1 word SINE-1 
INPUTP .space 1 
OUTPUTP .space 1 
-sect eEECCRE™ 
_cr4dif: 
LDI SP, ARO 
PUSH DP 
PUSH R4 ;Save dedicated registers 
PUSH R5 
PUSH R6 ; lower 32 bits 
PUSHF R6 ;upper 32 bits 
PUSH Ri ; lower 32 bits 
PUSHF R7 ;upper 32 bits 
PUSH AR3 
PUSH AR4 
PUSH AR5 
PUSH AR6 
PUSH AR7 
PUSH R8 
ear .REGPARM == 
LDI *-ARO (1) ,AR2 ;points to input data 
LDI *-ARO (2) ,R10 ;RLO=N 
DI *-ARO (3),R9 7;R9 holds the remain stage number 
LDI *—-ARO (4) ,RC ;points to where FFT result should move to 
else 
LDI R2,R10 
LDI R3,R9 
.endif 
LDP FFTSIZ ;Command to load data page pointer 
OL. AR2, @INPUTP 
STi RC, @OUTPUTP 
STIL R10, @FFTSIZ 
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STARTB: 


* OUTER LOOP 


Q@FFTSIZ,BK 
1,BK, IRO 

-2,BK,IR1 

1,AR7 
1,R8 
2,IR1,R9 
-1,BK 


WMHHHH 


RO=2*N1 (because of real/imag) 

R1=N/4, pointer for SIN/COS table 
nitialize IE index 

nitialize repeat counter of first loop 
9=JIT 

K=N2 


LOOP: LDI @INPUTP, ARO ;ARO points to X(T) 
SUBI3 1,R8,RC ;RC should be one less than desired # 
ADDI BK, ARO, AR1 ;AR1 points to X(I1) 
RP TBD BLK1 7Setup loop BLK1 
ADDI BK, AR1, AR2 ;AR2 points to X(I2) 
ADDI BK, AR2, AR3 ;AR3 points to X(I3) 
LDF *+AR1,RO ; RO=Y (11) 
* FIRST LOOP: BLK1 
ADDF RO, *+AR3,R3;R3=Y (11) +Y¥ (13) 
ADDF *+ARO, *+AR2,R1 ;R1=yY (I) +Y (12) 
ADDF R3,R1,R6 ;R6=R1+R3 
SUBF *+AR2,*+ARO,R4 ;R4=Y (I) -Y (12) 
LDF *AR2,R5 ; R5=X (12) 
STF R6, *+ARO ;Y (I) =R1+R3 
SUBF R3,R ;R1=R1-R3 
ADDF *AR3,*AR1,R3 ;R3=X (I1) +X (13) 
ADDF R5,*ARO,R1 ;R1=X (I) +X (12) 
| STF R1,*+AR1 ;Y(I1)=R1-R3 
ADDF R3,R1,R6 ;R6=R1+R3 
SUBF R5,*ARO,R2 ;R2=X (I) -X (12) 
| STF R6, *ARO++ (IRO) ;X(1)=R1+R3 
SUBF R3,R ;R1=R1-R3 
SUBF *AR3, *AR1,R6 ; R6=X (I1) -X (13) 
SUBF RO, *+AR3,R3 ; -R3=Y (I1)-Y (13) 
STF R1, *AR1++ (IRO) ;X(1I1)=R1-R3 
SUBF R6,R4,R5 ; R5=R4-R6 
ADDF R6,R4 ; R4=R44+R6 
STF R5,*+AR2 ;Y (12) =R4-R6 
STF R4,*+AR3 ;Y (13) =R4+R6 
SUBF R3,R2,R5 ;R5=R2+R3 
ADDF R3,R2 ;R2=R2-R3 
STF R2, *AR3++ (IRO) ;X (13) =R2+R3 
BLK1 STF R5,*AR2++(IRO) ;X (12) =R2-R3 
{| LDF *+AR1,RO ; RO=Y (11) gees 
me IF THIS IS THE LAST STAGE, YOU ARE DONE 
CMP I IR1,R8 
BZD ENDB 
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Example 6—14. Complex Radix-4 DIF FFT (Continued) 


* 


* MAIN INNER LOOP 
LDI 1,R10 
LDI 2,R11 
LDI R11, ARO 
ADDI @INPUTP, ARO 
ADDI 2; R11 
INLOP: ADDI AR7,R10 
ADDI BK, ARO, AR1 
CMP I R9,R11 
BZD SPCL 
ADDI BK, AR1L, AR2 
ADDI BK, AR2,AR3 
SUBI3 1,R8,RC 
LDI R10, AR4 
ADDI @SINTAB1, AR4 
ADDI AR4,R10,AR5 
SUBI 1,AR5 
RPTBD BLK2 
ADDI R10,AR5,AR6 
SUBI 1,AR6 
LDF *+AR2,R7 
* 
* SECOND LOOP: BLK2 
* 
ADDF R7,*+ARO,R3 
ADDF *+AR3,*+AR1,R5 
ADDF R5,R3,R6 
SUBF R7,*+ARO,R4 
SUBF R5,R3 
ADDF *AR2,*ARO,RL 
ADDF *AR3, *AR1,R5 
MPYF R3,*+AR5(IR1),R6 
| | STF R6, *+ARO 
ADDF R5,R1,R0 
SUBF *AR2,*ARO,R2 
SUBF R5,R1 
MPYF R1, *AR5,RO 
| | STF RO, *ARO++ (IRO) 
SUBF RO, R6 
SUBF *+AR3,*+AR1,R5 
MPYF R1,*+AR5(IR1),RO 
| | STF R6,*+AR 
MPYF R3, *AR5,R6 
ADDF RO,R6 
ADDF R5,R2,R 
SUBF R5,R2 
SUBF *AR3, *AR1,R5 
SUBF R5,R4,R3 
ADDF R5,R4 
MPYF R3, *+AR4(IR1),R6 
| | STF R6, *AR1++ (IRO) 
MPYF R1, *AR4,R0 
SUBF RO, R6 
MPYF R1,*+AR4(IR1),R6 
i STF R6, *+AR2 


;Init IAl index 
;Init loop counter for inner loop 


7 (X(I),Y(1I)) pointer 
;Increment inner loop counter 
; IAL=IAI1+IE 
7 (X(I1),Y(I1)) pointer 

;If LPCNT=JT, go to 

;special butterfly 

7 (X(12),Y(12)) pointer 

7; (X(13),Y(1I3)) pointer 

;RC should be one less than desired # 


;Create cosine index AR4 


; IA2=IA1+IA1-1 
;Setup loop BLK2 


; IA3=IA2+IA1-1 
;R7=Y (12) 


;R3=Y¥ (I) +¥ (12) 
;R5=Y (11) +Y (13) 
; R6=R3+R5 
;R4=Y (I) -Y (12) 
; R3=R3-R5 
;R1=X (I) +X (12) 
;R5=X (11) +X (13) 
; R6=R3*CO2 

;Y (I) =R3+R5 
;RO=R1+R5 
;R2=X (I) -X (12) 
;R1=R1-R5 

;RO=R1*SI2 

+X (I) =R1+R5 

; R6=R3*CO2-R1*SI2 
;R5=Y (11) -Y (13) 
;RO=R1*CO02 

) =R3*CO2-R1*SI2 
3*S12 
=R1*CO2+R3*SI12 
;RI=R2+R5 

=R2-R5 

; R5=X (11) -X (13) 
;R3=R4-R5 

;R4=R44+R5 

;R6=R3*COL 

7X (11) =R1*CO2+R3*SI2 
;RO=R1*SI1 
;R6O=R3*CO1L+R1*SI1 
;R6O=R1*COl1 

;Y (12) =R3*CO1-R1*ST1 


DArPDKD~DADXx DW 


ve) 
7 
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Example 6-14. Complex Radix-4 DIF FFT (Continued) 


PYF R3, *AR4, RO 
ADDF RO, R6 
PYF R4, *+AR6(IR1),R6 
STF R6, *AR2++ (IRO) 
PYF R2,*AR6,RO 
SUBF RO, R6 
PYF R2,*+AR6(IR1),R6 
| STF R6, *+AR3 
PYF R4, *AR6, RO 
ADDF RO, R6 
BLK2 STF R6, *AR3++ (IRO) 
| LDF *+BR2,R7 
CMP I R11,BK 
BPD INLOP 
LDI R11,AR0 
ADDI @INPUTP, ARO 
ADDI 2,R11 
BRD CONT 
LSH 2,R8 
LSH 2,AR7 
LDI BK, IRO 
* SPECIAL BUTTERFLY FOR W=J 
SPCL RPTBD BLK3 
LSH -1,IR1,AR4 
ADDI @SINTAB, AR4 
LDF *AR2,R7 
* SPCL LOOP: BLK3 
ADDF R7, *ARO,R1 
ADDF *+AR2,*+AR0,R3 
SUBF *+AR2,*+ARO,R4 
ADDF *AR3,*AR1,R5 
SUBF R1,R5,R6 
ADDF R5,R1 
ADDF *+AR3,*+AR1,R5 
SUBF R5,R3,R0 
ADDF R5,R3 
SUBF R7, *ARO, R2 
|| STF R3, *+ARO 
LDF *AR3,R7 
|| STF R1, *ARO++ (IRO) 
SUBF *+AR3,*+AR1,R3 
SUBF R7, *AR1,R1 
|| STF R6, *+AR1 
ADDF R3,R2,R5 
SUBF R2,R3,R2 
SUBF R1,R4,R3 
ADDF R1,R4 
SUBF R5,R3,R1 
MPYF R1, *AR4,R1 
|| STF RO, *AR1++ (IRO) 
ADDF R5,R3 
MPYF R3, *AR4,R3 
|| STF R1, *+AR2 
SUBF R4,R2,R1 
MP YF R1, *AR4,R1 


Sa Te ee Te ee 


~ 


~ 


~ 


Ne Ne Ne Ne 


TT ee eT ee ere Ty 


RO=R3*SI1 
R6=R1*CO1+R3*SI1 
R6=R4*CO3 
X (12) =R1*CO1+R3*SI1 
RO=R2*S13 
R6=R1*CO3-R2*S13 
R6=R2*CO3 
Y (13) =R4*CO3-R2*S13 
RO=R4*S13 
R6=R2*CO3+R4*S13 

x (13) =R2*CO3+R4*SI13 
Load next Y(I2) 


LOOP BACK TO THE INNER LOOP 


(X(I),Y(1I)) pointer 
Increment inner loop counter 


Increment repeat counter for next time 
TR=4* IE 
N1=N2 


Setup loop BLK3 

Point to SIN(45) 

Create cosine index AR4=CO21 
R7=X (12) 


R1=X (I) +X ( 
R3=Y¥ (I) +¥ ( 
R4=Y (I) -¥ ( 
R5=X (I1) +X 


R3=Y (I1)-Y¥ (13) 
R1=X (I1)-X (13) 
Y (I1) =R5-R1 


Y¥ (12) =(R3-R5) *CO21 
R1=R2-R4 
R1=R1*CO21 
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Example 6—14. Complex Radix-4 DIF FFT (Continued) 


| | STF R3,*AR2++(IRO) ;X(1I2)=(R3+R5) *CO21 
ADDF R4,R2 j;R2=R2+R4 
MPYF3 R2, *AR4,R2 ;R2=R2*CO21 
| | STF R1, *+AR3 7; Y (13) =- (R4-R2) *CO21 
BLK3 LDF *AR2,R7 ;Load next X(I2) 
| | STF R2, *AR3++(IRO) ;X(13)=(R4+R2) *CO21 
CMP I R11,BK 
BPD INLOP ;Loop back to the inner loop 
LDI R11,ARO 
ADDI @INPUTP, ARO 7 (X(I),Y(1I)) pointer 
ADDI 2p Rid ;Increment inner loop counter 
LSH 2,R8 ;Increment repeat counter for next time 
LSH 2,ART7 ; IE=4*IE 
LDI BK, TRO ;N1L=N2 
CONT BRD LOOP ;Next FFT stage (delayed) 
LSH -2,BK ;N2=N2/4 
LSH3 —1,BR,R9 
ADDI 2,R9 j JT=N2/2+2 
ENDB: 
KKK K KK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKKKKKAKKKK KK KKK KK 
*#—-—---------- BIT REVERSAL * 


* This bit-reversal section assumes input and output in Re-Im-Re-Im format * 
KKKKKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KK KKK KKKKKKKKKKKKAK KKK KK KKK KK 


LDI @INPUTP, ar0 
CMP I @OUTPUTP, arO 
BEOD INPLACE 
LDI @OUTPUTP, arl ; ar1=DST_ADDR 
LDI @FFTSIZ, irxr0 ; ir0=FFT_SIZE 
SUBI 27,1E0;rC ; CC=FFT_SIZE-2 
RP TBD bitrvl 
LDI 27. jirl=2 
LDF *tar0(1),r0 ;vead first Im value 
NOP LDF *ar0t++(ir0)b,rl1 
|| STF r0, *tarl1 (1) 
bitrvl LDF *tar0(1),r0 
|| STE rl1,*arlt++(irl) BUD END 
LDF *arO0t++(ir0)b,r1 
|| STE r0, *+tarl1 (1) 
NOP 
STF rl, *arlINPLACE 
RP TBD BITRV2 
NOP *t++arl1 (2) 
NOP *ar0t++(ird)b 
NOP CMP I arl,ar0O 
BGEAT CONT2 
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Example 6-14. Complex Radix-4 DIF FFT (Continued) 


LDF earl, r0 
{| LDF Far, 1 

STF r0,*axr0 
foul STF r1,*arl 

LDF *+tarl(1),r0 
1 | LDF *tar0(1),r1 

STF r0,*+ar0 (1) 
fal STF rl, tank) 
CONT2 OP *++arl1 (2) 
BITRV2 OP *ar0t++(ir0)b 
END: POP R8 ;Restore the register values and return 

POP AR7 

POP AR6 

POP AR5 

POP AR4 

POP AR3 

POPF R7 

POP R7 

POPF R6 

POP R6 

POP R5 

POP R4 

POP DP 

RETS 

end 
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6.5.3 Faster Complex Radix-2 DIT FFT 


Fast Fourier Transforms (FFTs) 


Example 6—12 and Example 6-14 provide an easy understanding of the FFT 
algorithm functions. However, those examples are not optimized for fast ex- 
ecution of the FFT. Example 6-15 shows a faster version of a radix-2 DIT FFT 
algorithm. This program uses a different twiddle factor table than the previous 
examples. The twiddle factors are stored in bit-reversed order and with a table 
length of N/2 (N = FFT length) as shown in Example 6-16. For instance, if the 
FFT length is 32, the twiddle factor table should be: 


Address 


0 


1 
2 
3 


12 
13 
14 
15 


Coefficient 

R{WN(0)} = COS(2*PI*0/32) = 1 
-I{WN(0)} =  SIN(2*PI*0/32) = +0 
R{WN(4)} = COS(2*PI*4/32) = 0.707 
-{WN(4)} = — SIN(2*PI*4/32) = «0.707 
R{WN(3)} = COS(2*PI*3/32) = 0.831 
-I{WN(3)} =  SIN(2*PI*3/32) = «(0.556 
R{WN(7)}. =  COS(2*PI*7/32) = 0.195 
-{WN(7)} = SIN(2*PI*7/32) = «0.981 
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Example 6-15. Faster Version Complex Radix-2 DIT FFT 


KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 


FILENAME : CR2DIT.ASM 

DESCRIPTION : COMPLEX, RADIX-2 DIT FFT FOR TMS320C40 
DATE : 6/29/93 

VERSION 2 440 


es 
VERSION DATE CO NTS 


(El 


inal version 

UND MEYER, KARL SCHWARZ 
STUHL FUER NACHRICHTENTECHNIK 
ERSITAET RLANGEN-NUERNBERG 
CAUERSTRASSE 7, D-8520 ERLANGEN, FRG 


ap acek DANIEL CHEN (TI HOUSTON): C40 porting 
7/1/92 ROSEMARIE PIEDRA (TI HOUSTON): made it 
C-callable and implemented changes in the order 


Wh 
oo 


he operands for some mpyf instructions for 
er execution when sine table is off-chip 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston): Added support 


for in-place bit reversing. 
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK 


SYNOPSIS: int cr2dit (SOURCE_ADDR,FFT_SIZE, DST_ADDR) 


ar2 r2 r3 
float * SOURCE_ADDR ; Points to where data is originated 
; and operated on. 
int FFT_SIZE ; 64, 128, 256, 512, 1024, 
float *DST_ADDR ; Points to where FFT results should be 
7 moved 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK 


THE COMPUTATION IS DONE IN-PLACE. 

FOR THIS PROGRAM THE MINIMUM FFT LENGTH IS 32 POINTS BECAUSE OF THE 

SEPARATE STAGES (THIS IS NOT CHECKED INSIDE THE 

FIRST TWO PASSES ARE REALIZED AS A FOUR BUTTERFLY LOOP SINCE THE 

MULTIPLIES ARE TRIVIAL. THE MULTIPLIER IS ONLY USED FOR A LOAD IN 

PARALLEL WITH AN ADDF OR SUBF. 
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK 

SECTIONS NEEDED IN LINKER COMMAND FILE: .ffttxt : fft code 


-fftdat : fft data 


KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KK KKK KKK 


W 


HE TWIDDLE FACTORS ARE STORED IN BIT-REVERSED ORDER AND WITH A TABLE LENGTH 
OF N/2 (N = FFTLENGTH). THE SINE TABLE IS PROVIDED IN A SEPARATE FILE 
WITH GLOBAL LABEL _SINE POINTING TO THE BEGINNING OF THE TABLE. 


+ + + F FF F FF FF FF FF FF FF F FF F FF FF + FF + FF FF FF FF + FF + FF FF FF FF FF OFX 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


* 
* 


TR 
TI 
AR’ 
AI’ 
BR’ 


+ + + + FF FF FF FF F FF F FF FF FF FF FF FF F FF F FF F F FF HF 


* 


KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK 


KKK KK KKK 


AR + 


BR + 


BI’= 


EXAMPLE: SHOWN FOR N=32, WN(n) = COS(2*PI*n/N) -— 3j*SIN(2*PI*n/N) 
ADDRESS COEFFICIENT 
0 R{WN(0)} = COS (2*PI*0/32) = 1 
1 —-I{WN(0)} = SIN(2*PI*0/32) = 0 
2 R{WN(4)} = COS(2*PI*4/32) = 0.707 
3 -I{WN(4)} = SIN(2*PI*4/32) = 0.707 
12 R{WN(3)} = COS (2*PI*3/32) = 0.831 
13 —-I{WN(3)} = SIN(2*PI*3/32) = 0.556 
14 R{WN(7)} = COS (2*PI*7/32) = 0.195 
15 -I{WN(7)} = SIN(2*PI*7/32) = 0.981 
WHEN GENERATED FOR A FFT LENGTH OF 1024, THE TABLE IS FOR ALL FFT 
.ENGTH LESS OR EQUAL AVAILABLE. 
HE MISSING TWIDDLE FACTORS (WN(),WN(),....) ARE GENERATED BY USING 
HE SYMMETRY WN(N/4+n) = -j*WN(n). THIS CAN BE REALIZED VERY EASY, BY 
CHANGING REAL- AND IMAGINARY PART OF THE TWIDDLE FACTORS AND BY 
NEGATING THE NEW REAL PART. 


KKK KK KKK KKK KKK KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK 


+ 
5 AI AR’ + 4 AI’ 
. eos 
\ / 
Was 
LX 
/ \ 
/ \ + 
j BI ---- ( COS - j SIN ) BR’ + j BI’ 
meat 
BR * COS + BI * SIN 
BI * COS - BR * SIN 
AR + TR 
AI + TI 
AR - TR 
AI - TI 


Kk 
-global _cr2dit ; Entry execution point. 
-global _SINE ; sine table pointer 
-global STARTB,ENDB ; starting/ending point for given 
; benchmarks 
-sect stEtdat! 
fg .space ; is FFT_SIZE 
fg2 .space ; is FFT_SIZE/2 
fg4m2 .space ; is FFT_SIZE/4 - 2 
fg8m2 .space 1 ; is FFT_SIZE/8 - 2 
sintab .word _SINE ; pointer to sine table 
sintp2 .word _SINE+2 ; pointer to sine table +2 
inputp2 .space ; pointer to input +2 
inputp .space ; pointer to source address 
outputp .space 7 pointer to dst address 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


, 
yj Initialize C Function. 
t 4 
.sect MEECUXL” 
wer2dit: LDI SP, ARO 
PUSH R4 
PUSH R5 
PUSH R6 
PUSHF R6 
PUSH R7 
PUSHF R7 
PUSH AR3 
PUSH AR4 
PUSH AR5 
PUSH AR6 
PUSH AR7 
PUSH DP 
if REGPARM == ; arguments passed in stack 
LDI *-ARO (1),AR2 ; src address 
LDI *-ARO(2),R2 ; FFT size 
LDI *=ARO (3),R3 ; dst address 
endif 
LDP fg ; Initialize DP pointer. 
STI R2,@fg ; fg = FFT_SIZE 
LSH -1,R2 ; R2 = FFT_SIZE/2 
STI AR2, @inputp ; inputp = SOURCE_ADDR 
ADDI 2,AR2,R0 
STI RO, @inputp2 ; inputp2= SOURCE_ADDR + 2 
STE R3, @outputp ; output = DST_ADDR 
STI R2,@fg2 ; fg2 = nhalb = (FFT_size/2) 
LSH -1,R2 
SUBI 2,R2,RO0 
STL RO, @£g4m2 ; fg4m2 = NVIERT-2 : (FFT_SIZI 
LSH -1,R2 
SUBI 2,R2,RO0 
STI RO, @£g8m2 
sai arO : AR + AI 
bs arl : BR + BI 
x, arZ 2 CR +0GL + CRY + Cr! 
* ar3 : DR + DI 
x ar4 : AR’ + AI’ 
*, ar5 : BR’ + BI’ 
* ar6 : DR’ + DI’ 
a ar7?y : first twiddle factor = 1 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


STARTB: 


lsh 
subi 


@f£g2,ir0d 
@sintab,ar7 
ar2,ar0 
ir0O,ar0O,arl 
ir0O,arl,ar2 
ir0,ar2,ar3 
ar0,ar4 
arl,ar5 
ar3,ar6 
2,161 
-1,ir0 
2,140, re 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne 


ir0 = n/2 = offset between SOURCE_ADDRs 
ar7 points to twiddle factor 1 

ar0 points to AR 

arl points to BR 

ar2 points to CR 

ar3 points to DR 

ar4 points to AR’ 

ar5 points to BR’ 

ar6 points to DR’ 

addressoffset 

irO = n/4 = number of R4-butterflies 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


FIRST 2 STAGI 


ES AS RADIX-4 BUTTERFLY a 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK 


fill pipeline 
addf 
subf 
addf 
subf 
addf 
mpyf 
subf 
addf 
stf 
subf 
stf 
addf 
mpyf 
subf 
rptbd 
addf 
stf 
subf 
stf 
addf 


*ar2, *ard;, 4 
*ar2,*ar0++,r5 
kar], ars, C6 
*karltt+, *ar3++,r7 
r6,r4,xr0 
kar, “arstt, cl 
r6,r4,xr3 

el, *ari; 20 

rO, *ar4t+ 
r1,*arlt++,r1 
eS, *ar e+ 
ely ro ac? 
kar], *+tar2,r1 
r 

b 

xr 


1k1 


Ne Ne Ne Ne Ne we 


r4 = AR + CR 
rs 
r6 
r7 = DR - BR 

AR’ = r0O = r4 + r6 

rl = DI, BR’ = r3 = r4 - r6 


ll 
> 
Bs) 

| 
Q 
vs) 


ll 
iw 
ve) 
w 
ve) 


r0 = BI + DI, AR’ = r0 
rl =BI-DI, 


CR’ = r2 = r5 4+ rl 
rl = CI, DR’ = r3 = r5 - rl 


Setup for radix-4 butterfly loop 
r2 = AI + CI , CR’ = r2 
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Example 6—15. Faster Version Complex, Radix-2 DIT FFT (Continued) 


* radix-4 butterfly loop 
* 
mpyf *ar7,*ar2—-—,r0 
subf EOYE2;E2 
mpyf *ary, *arl++yri 
addf Cipro, cS 
addf r0O, *ar0,r4 
stf r4, *ar4t++ 
subf r0, *arO0++,4r5 
stf r2,*ar5Stt+ 
subf haygmial sew) 
addf rl,*ar3,xr6 
stf ry, *arott 
subf fl, *arst+;r7 
stf r3,*ar2++ 
addf r6,r4,xr0 
mpyf Far, *argtty rd 
subf r6,r4,xr3 
addf rl, *arl, £0 
stf rO, *ar4t++ 
subf EL, tari++, 21 
stf r3,*ar5S++ 
addf fag eae yn ar 
mpyf AHA 2, Mae ei, 
subf EL pro prs 
addf El, *arcd, HZ 
stf r2,*ar2++ (irl) 
subf rl, *ar0++,r6 
stf e3, *aror+ 
blkl addf r0,r2,r4 
* clear pipeline 
* 
subf EO0;:627-£2 
addf Tero eS 
stf r4,*ar4 
| | stft r2,*ar5 
subf jay emai siren | 
stf rT, *ar6 
| | stf 3) *==ar2 
SO SS SSSR SseSs THIRD TO LAST-2 STAGE 


’ 


’ 


’ 


ro 


cl 


r4 


ro 


(DI’ 
r6 


r7 


AR’ 
eA 


ro = 


rl = 


CR’ 
rl 


r2 


r6 = 
AI’ 
BI’ 
CI’ 
AI’ 


DI’ 
DI’ 


AR + 


AR - 


= r7 
DR + 


DR - 


= r0 
DI , 


Els 
Biss, 


= r2 
= CI 


AI + 
AI - 
= r4 
= r2 
= 43 
= r4 


= r7 
= r7 


= r6 
BR , 


BR , 


= r4 
BR’ 


DI , 
DI , 


= r5 
DR’ 


GI's, 
cares 
= r2 
= r2 
= r6 

BL’ 


= r6 
Ci 


= r2 = r2 - x0) 
=.F3> = G00 Py) 
(AI’ = r4) 

(BI’ = r2) 

= wt} 

(DI’ = r7) 

(CI’ = 4x3) 
+ r6 
= r3 = r4 - r6 
AR’ = 4x0 

BR’ = r3 
el ged! 

SES = ro ari 
CR’ = r2 

DR’ = r3 
+ r0 
-— r0 
ee te] 

= 2 
Sie) 

= 63 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KKK KKK KKK 


* 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 
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ldi @fg2,irl 
subi Le 2rOy ars 
ldi 1,ar6 
ldi @sintab,ar7 ; pointer to twiddle factor 
ldi O,ar4 7 group counter 
ldi @inputp, ar0 
stufe ldi ar0O,ar2 ; upper real butterfly output 
addi ir0,ar0,ar3 ; lower real butterfly output 
ldi ar3,arl ; lower real butterfly input 
lsh 1,ar6 ; double group count 
lsh -2,ar5 ; half butterfly count 
lsh ly, ard ; clear LSB 
ish =1,i70 ; half step from upper to lower real part 
lsh =1,171 
addi ieee gall ; step from old imaginary to new 
; real value 
ldf *arlt++,r6 ; dummy load, only for address update 
| | ldf a ciee eeaat | ; v7 = COS 
gruppe 
* fill pipeline 
* 
* arO = upper real butterfly input 
* arl = lower real butterfly input 
* ar2 = upper real butterfly output 
* ar3 = lower real butterfly output 
* the imaginary part has to follow 
ldf *++ar7,r6 ; v6 = SIN 
mpyf tar le =) FO, 61 ; vl = BI * SIN 
|| addf *++ar4 , 0,63 ; Gummy addf for counter update 
mpyf Aare hy LO ; vO = BR * COS 
Lear ard, rc 
rptbd bflyl ; Setup for loop bflyl 
mpyf *x*ar7——,*arlt+t+,rO0 ; r3 = TR = r0 + rl, r0O = BR * SIN 
|| addf 60, £1553 
mpyf *arlt+t+,r7,r1 ; rl = BI * COS , r2 = AR - TR 
|| subf r3,*arQ,r2 
addf *arQsr+ 763,25 ; vO = AR + TR, BR’ = r2 
|| SEL r2, *ar3t+t+ 
bs FIRST BUTTERFLY-TYPE: 
* 
* TR = BR * COS + BI * SIN 
* TI = BR * SIN - BI * COS 
*  AR’= AR + TR 
*  AI’= AI - TI 
* BR’= AR - TR 
*  BI’= AI + TI 
bd loop bflyl 
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bflyl 


+ + + + + FF 4 HF 


bfly2 


= AI - 


= AI + 


mpyf *farl1, £6, 75 
St£ £5;-*ar2++ 
subf Ply COp,EZ 
mpyf Kav 1 07 7 LO 
addf £2, *axrd, £3 
subf r2, *arO++, r4 
stft r3,*ar3t++ 
addf £0, tb; es 
mpyf faril++, 76,60 
subf 63, *axr0;,.r2 
mpyf bait ln collec oe erly a ell 
stf r4, *ar2++ 
addf tart, Pay rS 
St£ £2; *ar3++ 

* switch over to next group 
subf EL, r0; 22 
addf 2, “aro; xs 
stf r5,*ar2++ 
subf £2, *ar0t++ (irl) , 4 
stf 235, *aro +t CLE) 
nop *arlt+(irl) 
mpyf Mavala Lees: 
stf r4, *ar2++ (irl) 
mpyf *arl,r6,r0 
ldi ar5,rc 
rptbd bfly2 
mpyf kar t+, *arl++, 50 
subf r0,el,:63 
mpyf saris alc leg ora a 
subf r3,*arQ,r2 
addf FarQ++, £3,215 
st£ 2; *ar3+t 


SECOND BUT 


= BI * 


BIy 
AR + 


AR - 


loo 
mpy 
stf 
add 
mpy 
add 
sub 
Ser 
sub 
mpy 
sub 
mpy 
stf 
add 
stf 


HO 


Fy FH FH FH 


Fy FH FH FH 


a) 


EBRFLY-TYPE: 


COS - BR * SIN 
IN + BR * COS 


1 Ww 


R 

TI 

TR 

TI 

bfly2 
Ftarl xr), 25 
r5,*ar2++ 
r1,r0,r2 


*x*arl,r6,r0 
x2, *ar0, x73 
r2,*ar0++,r4 
£3, *arst+ 
60,5743 
*arlt++,r7,xr0 
r3;*axr0, £2 
*arlt++,r6,r1 
r4, *ar2++ 
*arQbh, £3,455 
r2,*ar3t++ 


* clear pipeline 


, 


r5 = Bl * SIN , (AR’ = £5) 

(x2 = TI = r0 - rl) 

rO = BR * COS , (r3 = AI + TI) 
(r4 = AI - TI, BI’ = r3) 

r3 = TR = r0 + r5 

rO = BR * SIN , r2 = AR - TR 
rl = BI * COS , (AI’ = r4) 

r5 = AR + TR, BR’ = r2 

r2 = TI = r0 - rl 

r3 = AI + TI, AR’ = r5 

r4 = AI - TI, BI’ = 43 
address update 

rl = BI * COS , AI’ = r4 

rO = BR * SIN 

Setup for loop bfly2 

r3 = TR = rl - rO , r0O = BR * COS 
rl = BI * SIN, r2 = AR - TR 
r5 = AR + TR, BR’ = r2 

rS.= BI * COs , (AR? = £5) 

(r2 = TI = r0 + rl) 

r0O = BR * SIN, (r3 = AI + TI) 
(r4 = AI - TI, BI’ = r3) 

TR Sr = 65. a0 

r0 = BR * COS , r2 = AR - TR 
rl = BI * SIN, (AI’ = r4) 

r5 = AR + TR, BR’ = r2 
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addf cl, c0; x62 
addf £2, *axrd, x73 
|| stf 5) *ar2t+ 
cmpi ar6,ar4 
bned gruppe 
subf ra, *ar0e+ (irl), r4 
| | SEE 13; *ar3++ (Trl) 
ldf Sear hy CF 
| | stf r4, *ar2++ (irl) 
nop *arl++ (irl) 
* end of this butterflygroup 
cmpi 4,1r0 
bnzaf stufe 
ldi @sintab,ar7 
ldi 0,ar4 
di @inputp, aro 


, 


’ 


AR’ = £5 

do following 3 instructions 
r4 = AI - TI, BI’ = r3 

r7 = COS 

AI’ = r4 


branch here 
jump out after ld(n)-3 stage 


pointer to twiddle factor 
group counter 


KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


eS SSS SECOND LAST STAG 
ee ee ee ee ee ee ee es 


* £411 pipeline 


* 1. butterfly: w%0 
addf *ar0, *arl, 62 
subf *arilt++, *arQ++, £3 
addf tar0, *arl,xr0 
subf Sarit+, *ardg++, cl 

* 2. butterfly: w%0 
addf *ar0, *arl,r6 
subf *arle++, *“arQt+, £7 
addf *ar0,*arl,r4 
subf *arlt++(ir0),*ar0++(ir0),r5 
stf r2, *ar2t+t+ 

|| SEE r3, *ar3t++ 
SEL rO, *ar2t++ 

|| stf rl, *ar3t++ 
stf r6, *ar2t+t+ 

|| SEE Cy). Par3stt 
stf v4, *ar2++(ir0) 

|| stf r5, *ar3++ (ir0) 


ldi @inputp, arO 
ldi ar0O,ar2 
addi ir0O,ar0O,arl 
ldi arl,ar3 

ldi @sintp2,ar7 
ldi 5,ir0 

di @fg8m2,rce 


* 


ry 


Me Ne Ne Ne Ne 


Me Ne Ne Ne 


Sa Te ee re ee Ty 


upper output 

lower input 

lower output 

pointer to twiddle faktor 
distance between two groups 


> Ww 
HW 
in 
KOR 
OW 
a 
> Pp 
Hw 
+ | 
ww 
Hw 


AR’ = r6 AR + BR 
BR’ = r7 = AR - BR 

AI’ = r4 = AI + BI 
BI’ = r5 = AI - BI 
AR’ = r2 

a = oy 

(AI’ = r0) 

(BI’ = rl) 

AR’ = r6 

BR’ = r7 

AI’ = r4 

BI’ = r5 
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* 3. butterfly: w*M/4 
addf Rar Orts * tar, cS ; AR’ = r5 = AR + BI 
subf *arl,*ar0,r4 ; AI’ = r4 = AI —- BR 
addf *arlt++, *ar0--,r6 ; BI’ = r6 = AI + BR 
subf *arltt+, *ar0t++,r7 ; BR’ = r7 = AR - BI 

* 4, butterfly: w*M/4 
addf *tarl,*t++ar0,r3 ; AR’ = r3 = AR + BI 
ldf each LT ; vl = 0 (for inner loop) 

1 | ldf *arlt++,r0 ; vO = BR (for inner loop) 
rptbd bf2end ; Setup for loop bf2end 
subf *arlt++(ir0O),*arO++,r2 ; BR’ = r2 = AR - BI 
stf r5, *ar2t+ ; (AR’ = 4x5) 

1 | stf r7,*ar3t++ ; (BR’ = 17) 
stf KG; *ars+? r {BE = 716) 

* 5. to M. butterfly: 

loop bf2end 
ldf RAL TEt pL ; v7 = COS , ((AI’ = r4)) 
stf r4, *ar2++ 
ldf *x*ar7++,r6 ; v6 = SIN , (BR’ = r2) 
stf r2,*ar3s++ 
mpyf *tarl,r6,r5 ; vrS = BI * SIN , (AR’ = £3) 
stf r3,*ar2++ 
addf rl, CO, EZ , (cr2 = TI = r0 + rl) 
mpyf Fariy27;,r0 , vO = BR * COS , (r3 = AI + TI) 
addf rZy *ar0, 63 
subf r2,*ar0++(ir0),r4 ; (c4 = AI - TI, BI’ = 4x3) 
stf r3, *ar3++ (ir0) 
addf FO ;Eo)n3 , v3 = TR = r0 + r5 
mpyf *arlt++,r6,xr0 ;, rO = BR * SIN , r2 = AR - TR 
subf r3;-*axr0;r2 
mpyf Rare ye; El ; vl = BI * COS , (AI’ = r4) 
StL r4, *ar2++(ir0) 
addf *ar0t++,r3,r5 ; vrS = AR + TR BR’ = r2 
stf r2,*ar3t+ 
mpyf etarl, 66,05 ;, vrS = BI * SIN , (AR’ = £5) 
stf r5,*ar2++ 
subf bgp snical Oba aa , (v2 = TI = r0 - rl) 
mpyf Rae 1s Cy EO , rO = BR * COS , (r3 = AI + TI) 
addf eZ, “ard, 13 
subf r2,*ar0++,r4 , (c4 = AI - TI, BI’ = 4x3) 
stf r3,*ar3++ 
addf r0,r5,xr3 , v3 = TR = r0 + r5 
mpyf Karl + 66; £0 ; rO = BR * SIN , r2 = AR - TR 
subf V3; 7*ard,.r2 
mpyf *arlt++(ir0O),r7,r1 ; vl = BI * COS , (AI’ = r4) 
stf r4, *ar2++ 
addf *ar0t++,r3,r3 ; v3 = AR + TR BR’ = r2 
stf r2, *ar3t¢t+ 
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* 


bf2end 


mpyf 
stf 

subf 
mpyf 
addf 
subf 
stf 

subf 
mpyf 
subf 
mpyf 
stf 

addf 
stf 

mpyf 
stf 

addf 
mpyf 
addf 
subf 


stf 
subf 
mpyf 
subf 
mpyf 
addf 
clear pipeline 
stf 
stf 
addf 
addf 
stf 
subf 
stf 
stf 


tard ci, 65 
r3, *ar2tt+ 
C15 LO prZ 
*arl,r6,r0 
62, *ar0, 23 


r2,*ar0++(ir0),r4 
"3, "*ar3+t (1r0) 
LOU ,r5; 63 


*arlt++,r7,xr0 
¥3,"*ar0, r2 
*arit+;, £6,621 


r4, *ar2++ (ir0) 
Saro+rs apres 
r2,*ar3+t+ 
etarl,.c4:,F5 
C5, tar2++ 

EL RCO RLZ 


*arl,r6,xr0 
¥2,*ar0, 3 
r2,*ar0++,r4 


eS, Sar3++ 
©0 7-6 573 
*arl++, cy, Ld 
¥3,*%ar0, 62 

tarls+ (260), 66, ¢1 
Fard++, 173, 63 


r2, *ar3t++ 
r4, *ar2t++ 
E1062 
2, Rard; £3 
r3,*ar2t+t+ 
r2,*ar0,r4 
r3,*ar3 
r4,*ar2 


’ 


r5 = 


5K 
oO 
ll 


K 
an 
ll 


r2 = 


K 
ws 
ll 


= TR = 
= BR 


= wR 
= BR * COS , 


BI * COS , (AR’ 


= TI = rO - rl) 
BR * SIN , (x3 
= AI - TI , BI’ 


FO.) BO. 
* COS , r2 = 


BI * SIN , (AI’ 


AR + 
BI * COS , (AR! 


= TI = r0 + rl) 
BR * SIN , (x3 


= AI - TI, y(L) 


£5. =750 
r2 = 


BI * SIN, r3 = 


; BR’ = r2 , AI’ 
TI = r0 + rl 

AI + TI , AR’ = 
Al .-- Ti, Bi = 


; AI’ = r4 


AR 


A 


r3) 
Io TT) 
3) 
— TR 
r4) 
r5) 
Lo FL) 
BI’ = 
- TR 
+ TR 
r4 


KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK 


LAST STAGE 


* 


KKK KKK KKK KKK KKK KK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK 


* 


ldi 
ldi 
ldi 
ldi 
ldi 
ldi 
ldi 
fill pipeline 


@inputp, ar0 
ar0O,ar2 
@inputp2,arl 
arl,ar3 
@sintp2,ar7 
3,ir0 
@fg4m2,rc 


’ 


, 


’ 


uppe 


lowe 


r output 


r output 


pointer to twiddle factors 
group offset 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


x 1. butterfly: w%0 
addf *ar0,*arl,r6 ; AR’ = r6 = AR + BR 
subf *arlt++,*ar0t++,r7 ; BR’ = r7 = AR - BR 
addf *ar0,*arl,r4 ; AI’ = r4 = AI + BI 
subf *arlt++(ir0O),*ar0++(ir0),r5; BI’ = r5 = AI - BI 

* 2. butterfly: w*M/4 
addf *tarl,*ar0,r3 ; AR’ = r3 = AR + BI 
ldf eae}, eT ; vl = 0 (for inner loop) 

1 | ldf *arlt++,r0 ; rO = BR (for inner loop) 
rptbd bflend ; Setup for loop bflend 
subf *arlt++(ir0O),*ar0++,r2 ; BR’ = r2 = AR - BI 
stf r6, *ar2++ ; (AR’ = r6) 

1 | stf r7,*ar3t++ ; (BR’ = 17) 
stf r5, *ar3++ (ir0) ; (BI’ = 4x5) 

* 3. to M. butterfly: 

* loop bflend 
ldf RAE Tt eee) ; v7 = COS , ((AI’ = r4)) 
stf r4, *ar2++ (ir0) 
ldf *x*ar7++,r6 ; v6 = SIN , (BR’ = r2) 
StL r2,*ar3t++ 
mpyf *tarl,r6,r5 ; vrS = BI * SIN, (AR’ = r3) 
stf r3,*ar2++ 
addf rl, CO, EZ ; (v2 = TI = r0 + ri) 
mpyf Fariy27;,r0 ; rO = BR * COS , (r3 = AI + TI) 
addf rZy *ar0, ¢3 
subf r2,*ar0++(ir0),r4 ; (c4 = AI - TI, BI’ = r3) 
stf r3, *ar3++ (ir0) 
addf r0,r5,xr3 ; r3 = TR = r0 + £5 
mpyf *arlt++,r6,xr0 ; vO = BR * SIN , r2 = AR - TR 
subf r3,*ar0,r2 
mpyf *arlt++(ir0O),r7,r1 ; vl = BI * COS , (AI’ = r4) 
StL r4, *ar2++ (ir0) 
addf *ar0++,r3,r3 ; v3 = AR + TR, BR’ = r2 
stf r2,*ar3t+t+ 
mpyf etarl, 67,05 7rx5 = BI * COS , (AR’ = r3) 
stf r3,*ar2++ 
subf 1,160) 62 ;(e2 = TI = r0 - ri) 
mpyf Far], 76,00 ;crO = BR * SIN , (r3 = AI + TI) 
addf r2, Maro, 13 
subf r2,*ar0++(ir0),xr4 ; (v4 = AI - TI , BI’ = 4x3) 
stf r3, *ar3++ (ir0) 
subf 60 257-53 7x3 = TR = r0 - x5 
mpyf *arlt++,r7,xr0 ;rO0 = BR * COS , r2 = AR - TR 
subf 43, *ar0,7n2 

bflend mpyf *arlt++(ir0O),r6,r1 ;cvl = BI * SIN , r3 = AR + TR 
addf KACO, £3,253 

* clear pipeline 
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stf 
|| scit. 
addf 
addf 
|| stf 
subf 
|| Set 
stf 


r2, *ar3t+t+ 
r4, *ar2++ (ir0) 
r1,r0,r2 
r2;, *ard, 63 
r3, *ar2t+t+ 
ra, Faxrd;:xr 
cake Pntad = i da 

r4, *ar2 


;BR’ = r2 , (AI’ = £4) 


7r2 = TI = r0 + rl 
TI, AR’ = r3 


K 
Ww 
ll 
D 
FA 
+ 


;cr4 = AI - TI, BI’ = r3 


;AI’ = x4 


KKK KKK KKK KKK KKK KK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


KKKKK KKK KKK KK KK 


ENDB: 
KK OK OK KK Ke 


END OF FFT 


KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KEK KKK KKK KK KKK KKK KKK KKK KKK KK KKK 


* 


BITR 


EVERSAL 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KK KKK KK KKK 


* 


* This bit-reversal section assume input and output in Re-Im-Re-Im format * 
KKKKKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKKKKKKKKKK KKK KK KKK 


ldi 
cmpi 
beqd 
ldi 
ldi 
subi 
rpthbd 
ldi 


QO 
Fh Fh GQ. FH FH FH FO MB 


3) 
{e) 
HO 


a 
ct 


’ 


@inputp, ar0 
@outputp, ar0d 
INPLACE 
@outputp,arl 
@fg,ird 
2,14r0,re 
bitrvl 

Zp cel 
*tar0(1),r0 


lj earl 


; Return to C environment. 


i 

INPLACE 
rptbd BITRV2 
nop *++arl1 (2) 
nop *ar0++(ir0)b 
nop 


j; ar1=DSR_ADD 
; irO=FFT_SIZI 
;CC=FFT_SIZE 


| Ew 


2 


jirl=2 
;rvead first Im value 
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cmpi arl,ar0 

bgeat CONT 

ldf earieeO 
1 | ldf *ar0,r1 

stf r0, *ar0 
\ | stf ri,;*arl 

ldft *+ar1 (1),¢0 
\ | ldf *tar0(1),r1 

stf r0,*+ar0 (1) 
{| stf rl, *+arl1(1) 
CONT nop *++arl1 (2) 
BITRV2 nop *ar0t++(ir0d)b 
; Return to C environment 
end: POP DP 

POP AR7 

POP AR6 

POP AR5 

POP AR4 

POP AR3 

POPE R7 

POP R7 

POPP R6 

POP R6 

POP R5 

POP R4 

RETS 

.end 
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KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKKKKKKK KKK KK KK 
* 
ai SINTAB.ASM : Bit-reversed sine table for a 64-point 
5 File to be linked with the source code for a 
Bs 64-point radix-2 DIT FFT 
* Sine table length = FFT size / 2 
* 
KKK K KKK KKK KKK KKK KKK KKK KK KK KKK KKK KKK KKK KKK KKK KKKKKKKKKK KKK KKK KK 
-global _SINE 
-sect ”.sintab” 
_SINE 
Float 1.000000 
Float 0.000000 
Float 0.707107 
Float 0.707107 
Float 0.923880 
Float 0.382683 
Float 0.382683 
Float 0.923880 
Float 0.980785 
Float 0.195090 
Float 0.555570 
Float 0.831470 
Float 0.831470 
float 0.555570 
Float 0.195090 
Float 0.980785 
Float 0.995185 
Float 0.098017 
Float 0.634393 
Float 0.773010 
Float 0.881921 
Float 0.471397 
Float 0.290285 
Float 0.956940 
Float 0.956940 
Float 0.290285 
Float 0.471397 
Float 0.881921 
float 0.773010 
Float 0.634393 
float 0.098017 
Float 0.995185 
end 
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6.5.4 Real Radix-2 FFT 


Most often, the data to be transformed is a sequence of real numbers. In this 
nstrates certain symmetries that permit the reduction of 
the computational load even further. Example 6-17 and Example 6-18 show 
the generic implementation of a real-valued radix-2 FFT (forward and inverse). 
total storage required for a length-N transform is only N 
lex FFT, 2N are necessary. Recovery of the rest of the 
is based on the symmetry conditions. 


case, the FFT demo 


For such an FFT, the 
locations; in a comp 
points 


(Example 6-13) should be used to provide the twiddle factors. 


Example 6-17. Real Forward Radix-2 FFT 


KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KK 


F LLENAME FFFT_RL.ASM 
DESCRIPTION REAL, RADIX-2 DIF 
DATE 1/19/93 

VERSION 3.0 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK 


Dy 


ERSION DAT 


COMMEN 


S 


60), 7/18/91 ALEX TESSA 


KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


FFT FOR TMS320C40 


KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


ROLO(TI Australia): 


Original R 
ALEX TESSA 


0 


elease (C30 version) 


7/23/92 
Most Stage 
Minimum FF 
Faster in 
Program si 
One extra 


ROLO(TI Australia): 

s Modified (C30 version). 

T Size increased from 32 to 64. 
place bit reversing algorithm. 
ze increased by about 100 words. 
data word required. 


.0 L/19793 ROSEMARI 


PIEDRA(TI Houston): 


C40 portin 
version 2. 
of registe 


KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


g started from C30 forward real FFT 
0. Expanded calling conventions to the use 
rs for parameter passing. 


KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK 


+ + + + FF + FF + FF F FF FFF FF FF FF FF FF + FF + FF FF FF FF HF FH 


SYNOPSIS: 
int ffft_rl (FFT_SIZE, LOG_SIZE, SOURCE_ADDR, DEST_ADDR, SINE_TABLE, BIT_REVERSE) 
ar2 2 c3 re rs re 
int FFT_SIZE ; 64, 128, 256, 512, 1024, 
int LOG_SIZE a. 96%, gin 8, oO; TOs. x 
float * SOURCE_ADDR ; Points to location of source data. 
float *DEST_ADDR ; Points to where data will be 
; operated on and stored. 
float *SINE_TABLE ; Points to the SIN/COS table. 
int BIT_REVERSE , = O, Bit Reversing is disabled. 
; <> 0, Bit Reversing is enabled. 
NOTE: 1) If SOURCE_ADDR = DEST_ADDR, then in place bit reversing 
is performed, if enabled (more processor intensive). 
2) FFT_SIZE must be >= 64 (this is not checked). 
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Fast Fourier Transforms (FFTs) 


Example 6-17. Real Forward Radix-2 FFT (Continued) 


* 


KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 


* 


++ + + + + FF FF FF F FF F FF F FF FF FF FF FF FF FF F FF F FF FFF FFF FF FF FF F OF 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KKK KKK KK KKK 


DESCRIPTION: 


Generic function to do a radix-2 FFT computation on the C40. 

The input data array is FFT_SIZE-long with only real data. The output is 
stored in the same locations (in-place) with real and imaginary 

points R and I as follows: 


DEST_ADDR[0] as R (0) 
R(1) 
R(2) 
R (3) 
R(FFT_SIZE/2) 
I (FFT_SIZE/2 - 1) 
I (2) 
DEST_ADDR[FFT_SIZE - 1] —> I (1) 


The program is based on the FORTRAN program in the paper by Sorensen et al., 
June 1987 issue of Trans. on ASSP. 


Bit reversal is optionally implemented at the beginning of the function. 


The sine/cosine table for the twiddle factors is expected to be supplied in 
the following format: 


SINE_TABLE[0] => sin(0*2*pi/FFT_SIZ 
sin(1*2*pi/FFT_SIZI 


Fi Fl 


sin ((FFT_SIZE/2-2) *2*pi/FFT_SIZ!I 
SINE_TABLE[FFT_SIZE/2-1] -> sin ((FFT_SIZE/2-1) *2*pi/FFT_SIZ 


a 


NOTE: The table is the first half period of a sine wave. 


NOTES: 1. Calling C program can be compiled with large or small model. Both 
calling conventions methods: stack or register for parameter 
passing are supported. 


2. Sections needed in linker command file: .ffttxt : fft code 
-fftdat : fft data 


a 


3. The DEST_ADDR must be aligned such that the first LOG_SIZE bits 
are zero (this is not checked by the program) 


Caution: DP initialized only once in the program. Be wary with interrupt 
service routines. Make sure interrupt service routines save the DP 
pointer. 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


REGIST 


MEMORY 


+ + + FFF FF FF FF FF FF FF FF HF 


KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KEK KKK KKK KKK KKK KEK KKK KKK KKK KKK 


ately) 


ERS USED: RO, Rl, R2, R3, R4, R5, R6, R7 
ARO, AR1, AR2, AR3, AR4, ARS, AR6, ART 
IRO, IR1 
RC, RS, RE 
DP 
REQUIREMENTS: Program = 405 Words (approxim 
Data = 7 Words 
Stack = 12 Words 


BENCHMARKS: Assumptions — Program in RAMO 


— Reserved data in RA 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK KKK 


0 


— Stack on Local/Global Bus RAM 


Sine/Cosine tables 
-— Processing and data 
— Local/Global Bus RA 


in RAMO 
destination in RAMI. 
, O wait state. 


FFT Size Bit Reversing Data Source Cycles (C40) 

et ny eee le ep 
* 1024 OFF RAM1 19404 approx. 
* 
* Note: This number does not include the C callable overheads. 
* This benchmark is the number of cycles between labels STARTB and ENDB. 
* 
* NOTE: 
* - If .ffttxt is located off-chip, enable cache for faster performance 
* 
KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKKKKKK KKK KK KKKK 
* 
FP -set AR3 

-Global _ffft_rl ; Entry execution point. 

-global STARTB,ENDB 
FFT STAR -usect ",fftdat”,1 ; Reserve memory for arguments. 
LOG_SIZE: -usect * Etidat* 1 
SOURCE_ADDR: -usect w  Eittdat’, 1 
DEST_ADDR: -usect * titaat’,2 
SINE_TABLE: -usect * -£ftdat",1 
BIT_REVERSE: .usect * Eiftdat", 1 
SEPARATION: -usect " fftdat’,1 
* 
* Initialize C Function 
* 

.sect So FLEECE! 

SEPEe rls PUSH FP ; Preserve C environment. 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


Fast Fourier Transforms (FFTs) 


Ne Ne Ne Ne Ne Ne 


Ne Ne Ne Ne Ne Ne 


Ie Ne Ne Ne Ne 


LDI SP, FP 

PUSH R4 

PUSH R5 

PUSH R6 

PUSHF R6 

PUSH R7 

PUSHF R7 

PUSH AR4 

PUSH AR5 

PUSH AR6 

PUSH AR7 

PUSH DP 

LDP FFT_SIZE ; Initialize DP pointer. 

sa£ . REGPARM== ; arguments passed in stack 

LDA *-FP (2) ,AR2 

LDI *-FP (3),R2 

LDI *-FP (4),R3 

LDI *-FP(5),RC 

LDI *-FP (6),RS 

LDI *-FP(7),RE 

-endif 

STI AR2, @FFT_SIZE 

STI R2,@LOG_SIZE 

STI R3, @SOURCE_ADDR 

STI RC, @DEST_ADDR 

STI RS, @SINE_TABLE 

STI RE, @BIT_REVERSE 
Check Bit Reversing Mode (on or off). 
BIT_REVERSING = 0, then OFF (no bit reversing). 
BIT_REVERSING <> 0, Then ON. 

LDI @BIT_REVERSE, RO 

BZ MOVE_DATA 
Check Bit Reversing Type. 


If SourceAddr 


Bit reversing Type 


NOTE: 


E 


abs (SOURCE_ADDR - DEST_ADDR) 


LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BEQ IN_PLACE 


(From Source to Destination). 


DestAddr, Then In Place Bit Reversing. 
If SourceAddr <> DestAddr, Then Standard Bit Reversing. 


must be > FFT_SIZI 


ry 


, this is not checked. 


Applications-Oriented Operations 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


LDI @FET_SIZE, RO 
SUBI 2,R0 
LDA @FEFT_SIZE, IRO 
LSH -1, IRO ;IRO = Half FFT size. 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R 
RPTS RO 
LDF *ARO++,R1 

lI STF R1, *AR1++(IRO)B 
STF R1, *AR1++(IRO)B 
BR STARTB 


, 
; In Place Bit Reversing. 
; Bit Reversing On Even Locations, 1st Half Only. 


IN_PLACE: LDA @GFFT_SIZE, IRO 
LSH -2,IRO ;IRO = Quarter FFT size. 
LDA 2,IR1 
LDI @FFT_SIZE, RC 
LSH -—2,RC 
SUBI 3,RC 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
LDA ARO, AR2 
NOP *AR1++ (IRO)B 
NOP *AR2++(IRO)B 
LDF *++ARO(IR1),RO 
LDF *AR1,R1 
RP TBD BITRV1 
CMP I AR1,ARO ;Xchange Locations only if ARO<ARI1. 
LDFGT RO,R1 
LDFGT *AR1++(IRO)B,R1 
LDF *++ARO(IR1),RO 
\ | STF RO, *ARO 
LDF *AR1,R1 
|| STF R1, *AR2++(IRO)B 
CMP I AR1, ARO 
LDFGT RO,R1 
BITRVL: LDFGT *AR1++ (IRO)B, RO 
STF RO, *ARO 
STF Ri, *AR2 
, 


;Perform Bit Reversing, Odd Locations, 2nd Half Only 


L 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


Fast Fourier Transforms (FFTs) 


LDI @FFT_SIZE,RC 
LSH <1 ,RE 
LDA @DEST_ADDR, ARO 
ADDI RC, ARO 
ADDI 1,AR0 
LDA ARO, AR1 
LDA ARO, AR2 
LSH 1 RE 
SUBI 37 RC 
NOP *AR1++(IRO)B 
NOP *AR2++(IRO)B 
LDF *++ARO (IR1),RO 
LDF *AR1,R1 
RP TBD BITRV2 
CMP I AR1, ARO ;Xchange Locations only if ARO<AR1 
LDFGT RO,R1 
LDFGT *AR1++ (IRO)B,R1 
LDF *++ARO0 (IR1),RO 
| | STF RO, *ARO 
LDF *AR1,R1 
| | STF R1, *AR2++(IRO)B 
CMP I AR1, ARO 
LDFGT RO,R1 
BITRV2: LDFGT *AR1++ (IRO)B, RO 
STF RO, *ARO 
STF R1, *AR2 
;Perform Bit Reversing, Odd Locations, lst Half Only 
LDI @FFT_SIZE,RC 
LSH =1.;RE 
LDA RC, IRO 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
ADDI 1,ARO0 
ADDI IRO,AR1 
LSH =1y,RC 
LDA RC, IRO 
SUBI 2,RC 
RP TBD BITRV3 
NOP ;Note: could be instruction 
LDF *ARO,RO 
LDF *AR1,R1 
LDF *++AR0 (IR1),RO 
| | STF RO, *AR1++(IRO)B 
BITRV3: LDF *ARI,RI1 
| | STF R1, *-ARO (IR1) 
STF RO, *AR1 
| | STF R1, *ARO 
BR STARTB 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


Ne Ne Ne Ne Ne 


RO 
R1 
R4 
R5 
R6 


, 
; Check Data Source Locations. 
, 
; If SourceAddr = DestAddr, Then do nothing. 
; If SourceAddr <> DestAddr, Then move data. 
v 
MOVE_DATA: LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BE STARTB 
LDI @GFFT_SIZE, RO 
SUBI 2,R0 
LDA @SOURCE_ADDR, ARO 
LDA @DE ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R1 
jis} STF R1, *AR1++ 
STF R1, *AR1 
t 
; Perform first and second FFT loops. 
, 
; | AR1 -> |__I1 0. <= [X(I1) 
7 | AR2 -> |__1I2 1 <- [X(TI1) 
: | AR3 -> |__I3__| 2 <- [xX(I1) 4 
7 |_ AR4 => |_ 14 3 <- -[X(I3) 
: ARI. => || 4 
i | 
, 
7 : 
i \I/ 
, 
STARTB: LDA @DEST_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
LDA AR1,AR4 
ADDI 1,AR2 
ADDI 2,AR3 
ADDI 3,AR4 
LDA 4,IR0 
LDI @GFFT_SIZE, RC 
LSH =2;,RC 
SUBI ZeeRC 
LDF *AR2,RO 
|| LDF *AR3,R1 
ADDF3 R1, *AR4,R4 
SUBF3 R1, *AR4++(IRO),R5 
SUBF3 RO, *AR1,R6 


[X(I3) + X(1I4)] 


[X(I3) + X(1I4)] 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


td 
RPTBD LOOP1_2 ; 
ADDF3 RO, *AR1++(IRO),R7 ; R7 = X(I1) + X(I2) 
ADDF3 R7,R4,R2 RZ SRA + Re en 2 
SUBF3 R4,R7,R3 ; R3 = R7 —- R4 --+ 
4 
LDF *+AR2 (IRO),RO ; 
LDF *+AR3 (IRO),R1 ; 
ADDF3 R1, *AR4,R4 j 
STF R3, *AR3++ (IRO) ; X(13) <--------4 
SUBF3 R1, *AR4++(IRO),R5 ; 
STF R5, *-AR4 (IRO) ; X(I4) < + 
SUBF3 RO, *AR1,R6 F 
STF R6, *AR2++ (IRO) ; X(I2) < + 
ADDF3 RO, *AR1++(IRO),R7 ; 
STF R2,*-AR1(IRO) 3; X(I1) <-----------4 7 
ADDF3 R7,R4,R2 
LOOP1_2: SUBF3 R4,R7,R3 
STF R3, *AR3 
| | STF R5, *-AR4 (IRO) 
STF R6, *AR2 
| | STF R2,*-AR1 (IRO) 
, 
; Perform Third FFT Loop. 
, 
, Part. A 
f — A aneeecheaanieh 
; | AR1 -> Lore oe TD) oe ES) 
i | 1 
- | pee ae 
i | —_____| 3 
: | AR2 -> ESB |) 4) <= CEL): HS KCTS) 
i | ee hes, 
5 | AR3 -> _ 14 | 6 <- -X(T4) 
i \ 7 
: AR1 -> 8 
: 9 
, 
; 
; : 
; \I/ 
LDA @DEST_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
ADDI 4,AR2 
ADDI 6, AR3 
LDA 8, IRO 
LDI @FFTI_SIZE,RC 
LSH -3,RC 
SUBI 2,RC 
RP TBD LOOP3_A 
SUBF3 *AR2,*AR1,R1 
ADDF3 *AR2,*AR1,R2 
NEGF *AR3,R3 
LDF *+AR2 (IRO),RO 7 RO = X(I3) 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


STE R2, *AR1++(IRO) 
SUBF3 RO, *AR1,R1 7; RL = X&(T1) - X(73) ---=-- + 
STF R1, *AR2++ (IRO) 7 | 
ADDF3 RO, *AR1,R2 2 Re = “XCEL ACES) = | 
STF R3, *AR3++ (IRO) 7 | 
LOOP3_A: NEGF *AR3,R3 PRS = =—xX(24). -—+ | 
| 
, 
STF R2, *AR1 3 SEL) <= + | 
STF R1, *AR2 - KTS) < + 
STF R3, *AR3 S 3014) meee eed + 
, 
7; Part. B: 
, = 
i | 0 
7 | ARO -> Higib 1 <- X(I1) + [X(1I3)*COS + X(14) *COS] 
i | 2 
7 | AR1 -> I2 3 <- X(I1) - [X(1I3)*COS + X(14)*COS] 
i | 4 
F | AR2 -> |__I3__| 5 <- -X(I2) - [XK(1I3)*COS - X(14)*COS] 
i | 6 
7 |_AR3 -> |__I4__| 7 <- X(I2) - [K(1I3)*COS - X(14)*COS] 
; 8 
, 
; ARO -> 9 NOTE: COS(2*pi/8) = SIN(2*pi/8) 
, 
, 
; - 
i \I/ 
, 
LDI @FFT_SIZE, RC 
LSH —3,RC 
LDA RC, IRL 
SUBI 3,RC 
LDA 8, IRO 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
LDA ARO, AR2 
LDA ARO, AR3 
ADDI 1,AR0 
ADDI 3,AR1 
ADDI 5,AR2 
ADDI 7,AR3 
LDA @SINE_TABLE, AR7 ; Initialize table pointers. 
LDF *++AR7(IR1),R7 ; RT = COS (2*pi/8) 
; *AR7 = COS (2*pi/8) 
MPYF3 *AR7, *AR2, RO ; RO = X(I3)*COS 
MPYF3 *AR3,R7,R1 , RS = X(14)*COS 
ADDF3 RO,R1,R2 ; R2 = [X(13)*COS + X(14) *COS] 
MPYF3 *AR7, *+AR2 (IRO) ,RO 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


|| SUBF3 RO,R1,R3 ; R3 = -[X(I3)*COS - X(I4)*COS] 
SUBF3 *AR1,R3,R4 ; R4 = -X(I2) + R3 --+ 
, 
RPTBD LOOP3_B i 
ADDF3 *AR1,R3,R4 ; R4 = X(I2) + R3 --|--+ 
STF R4, *AR2++(IRO) ; X(I3) <------------+ 
SUBF3 R2,*ARO,R4 ; R4 = X(I1) - R2 --+ 
STF R4, *AR3++ (IRO) ; X(I4) < + 
ADDF3 *BRO,R2,R4 ; R4 = X(I1) + R2 --|--+ 
STF R4, *AR1++(IRO) ; X(I2) <------------+ 
MPYF3 *AR3,R7,R1 ; 
STF R4, *ARO++ (IRO) ; X(I1) < + 
ADDF3 RO,R1,R2 
MPYF3 *AR7, *+AR2 (IRO), RO 
SUBF3 RO,R1,R3 
SUBF3 *AR1,R3,R4 
ADDF3 *AR1,R3,R4 
STF R4, *AR2++ (IRO) 
SUBF3 R2,*ARO,R4 
STF R4, *AR3++ (IRO) 
LOOP 3_B: ADDF3 *BRO,R2,R4 
STF R4, *AR1++(IRO) 
MPYF3 *AR3,R7,R1 
STF R4, *ARO++ (IRO) 
ADDF3 RO,R1,R2 
SUBF3 RO,R1,R3 
SUBF3 *AR1,R3,R4 
ADDF3 *AR1,R3,R4 
|| STF R4, *AR2 
SUBF3 R2,*ARO,R4 
|| STF R4, *AR3 
ADDF3 *BRO,R2,R4 
|| STF R4, *ARL 
STF R4, *ARO 


Perform Fourth FFT Loop. 


Part A: 


bh 


AR1-> | 0 <- X(I1) + X(13) 


AR2-> 


OHATDUOBWNHRH 


AR3-> 


H 
od 


__| 12 <- -xX(T4) 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne ee Ne Ne 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


Ne Ne Ne Ne Ne Ne 


AR1-> 


ARO 


fe = T Ot | 
ee 
| 


\L/ 


16 


@DEST_ADDR, AR1 


AR1,AR2 
AR1,AR3 
8, AR2 
12,AR3 
16, IRO 
@FFT_SIZI 
-4,RC 
2,RC 
LOOP 4_A 


*AR2, *AR1 
*AR2, *ARI 


*AR3,R3 
*+AR2 


R2, *AR1+ 


BE, RC 


,R1 
L,R2 


(IRO),RO 
+ (IRO) 


RO, *AR1,R1 
R1, *AR2++ (IRO) 
RO, *AR1, R2 
R3, *AR3++ (IRO) 


*AR3,R3 
R2, *AR1 
R1, *AR2 
R3, *AR3 
I1_ (3rd) 
T1_ (2nd) 
I1_(1st) 
I2_ (1st) 
__12_ (2nd) 
I2_ (3rd) 
I3_ (3rd) 
I3_ (2nd) 
T3_(1st) 
T4_ (1st) 
__ 14 (2nd) 
T4_ (3rd) 
| 16 
| 
\I/ 


;RO = X(I3) 
+R 
fR2 
;R3 
7X 
7X 
7X 

0 

Te “ee CLT) 
2 . 

3 

4 

5 

6 : 

t <= SCY) 
8 

O° k=. =KtT2) 
10 

11 

12 

13 

14 é 

15: <-. X(12) 
17 


+ 


X(I1) - X(I3) -----1 + 
X(I1) + X(I13) --+ 
-X(I4) --+ 

Z : 

< 

ear 

[X(13) *COS + X(I4) *SIN] 
[X(I3) *COS + X(I4) *SIN] 
[X (13) *SIN - X(I4) *COS] 
[X(13) *SIN - X(I4) *COS] 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


DA 


DA 
DF 


@FFT_SIZI 
-4,RC 
RC, IRL 
2, IR0 

S37 RC 
@DEST_ADDR, ARO 
ARO, AR1 

ARO, AR2 

ARO, AR3 

ARO, AR4 

1, ARO 

7,AR1 


Ut 


@SINE_TABLE, ART 
*++4AR7(IR1),R7 


*++AR6 (IR1),R6 


*++4AR5 (IR1),R5 


*AR7, *AR4, RO 
*++AR2 (IRO),R5,R4 
*--AR3(IRO),R5,R1 
*AR7, *AR3, RO 


*AR6, *-AR4, RO 
R4,R0,R3 
*—--AR1(IRO),R3,R4 
*AR1,R3,R4 

R4, *AR2-- 
R2,*++ARO (IRO),R4 
R4, *AR3 
*ARO,R2,R4 

R4, *ARL 


*++AR3,R6,R1 
R4, *ARO 
RO,R1,R2 
*ARS, *-AR4 (IRO),RO 
RO,R1,R3 
*++AR1,R3,R4 
*AR1,R3,R4 
R4, *AR2 
R2,*--ARO,R4 
R4, *AR3 
*ARO,R2,R4 
R4, *ARL 
*--AR2,R7,R4 
R4, *ARO 
*++AR3,R7,R1 
*AR5, *AR3, RO 


= SIN(1*[2*pi/16]) 


;*AR7T = COS (3*[2*pi/16]) 


= SIN(2*[2*pi/16]) 


;*AR6 = COS (2*[2*pi/16]) 


= SIN(3*[2*pi/16]) 


7;*ARS = COS (1*[2*pi/16]) 


= X(I3) *COS (3) 

= X(I3)*SIN(3) 

= X(14) *SIN(3) 

= X(I4) *COS (3) 
[X(I3)*COS + X(I4) *SIN] 

= —[X(13)*SIN - X(1I4)*COS] 
X (12) R3 + 

= X(I2) + R3 --|--+ 

3) <------------4 7 

= Se (D); (SRO Se 

14) < # 

= (ED) Re. aa eat 

LD). Betas seen ass 7 

Tl) < + 
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Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


ADDF3 RO,R1,R2 

MPYF3 *ART7, *++AR4 (IR1),RO 

SUBF3 R4,R0,R3 

SUBF3 *+4+AR1,R3,R4 

RPTBD LOOP 4_B 

ADDF3 *AR1,R3,R4 

STF R4, *AR2++(IR1) 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3++ (IR1) 

ADDF3 *RRO,R2,R4 

STF R4, *AR1++(IR1) 

MPYF3 *++AR2 (IRO),R5,R4 

STF R4, *ARO++(IR1) 

MPYF3 *—--AR3(IRO),R5,R1 

MPYF3 *AR7, *AR3,RO 

ADDF3 RO,R1,R2 

MPYF3 *AR6, *-AR4, RO 

SUBF3 R4,R0,R3 

SUBF3 *—-AR1(IRO),R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2-- 

SUBF3 R2,*++ARO(IRO),R4 

STF R4, *AR3 

ADDF3 *BRO,R2,R4 

STF R4, *ARL 

MPYF3 *++AR3,R6,R1 

STF R4, *ARO 

ADDF3 RO,R1,R2 

MPYF3 *AR5,*-AR4 (IRO),RO 

SUBF3 RO,R1,R3 

SUBF3 *+4+AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3 

ADDF3 *BRO,R2,R4 

STF R4, *ARL 

MPYF3 *—--AR2,R7,R4 

STF R4, *ARO 

MPYF3 *+4AR3,R7,R1 

MPYF3 *AR5, *AR3,RO 

ADDF3 RO,R1,R2 

MPYF3 *AR7, *++AR4 (IR1),RO 

SUBF3 R4,R0,R3 

SUBF3 *+4+AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2++(IR1) 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3++ (IR1) 
LOOP4_B: ADDF3 *BRO,R2,R4 

STF R4, *AR1++(IR1) 

MPYF3 *++AR2 (IRO),R5,R4 

STF R4, *ARO++ (IR1) 

MPYF3 *—--AR3(IRO),R5,R1 

MPYF3 *AR7, *AR3, RO 

ADDF3 RO,R1,R2 

MPYF3 *AR6, *-AR4, RO 
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SUBF3 R4,R0,R3 

SUBF3 *——AR1(IRO),R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2-- 

SUBF3 R2,*++ARO (IRO),R4 

STF R4, *AR3 

ADDF3 *ARO,R2,R4 

STF R4,*ARL 

MPYF3 *++AR3,R6,R1 

STF R4, *ARO 

ADDF3 RO,R1,R2 

MPYF3 *AR5, *-AR4 (IRO),RO 

SUBF3 RO,R1,R3 

SUBF3 *++AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2 

SUBF3 R2,*--ARO,R4 

STE R4, *AR3 

ADDF3 *ARO,R2,R4 

STF R4,*ARL 

MPYF3 *--AR2,R7,R4 

STE R4, *ARO 

MPYF3 *++AR3,R7,R1 

MPYF3 *AR5, *AR3, RO 

ADDF3 RO,R1,R2 

SUBF3 R4,R0,R3 

SUBF3 *++AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2 

SUBF3 R2,*--ARO,R4 

STE R4, *AR3 

ADDF3 *ARO,R2,R4 

STF R4,*ARL 

STE R4, *ARO 
4 
; Perform Remaining FFT loops (loop 4 onwards). 
Ff 
; LOOP 
4 lst 2nd 
i = \/ \/ 
H X’ (11) 0 QO «<= XX? (IL) + X’ (13) 
; AR1-> X(I1)_(1st) al 1 <- X(I1) + [X(1I3)*COS + X(I4) *SIN] 
: X(I1)_ (2nd) 2 2 7 
x (I1)_ (3rd) 3 3 
; 3 
7 
; Ay > 
; ME (T2) = 2.8 16 
7 BSS A 
7 
‘4 
H X(12)_ (3rd) 13 29 - 
; X(12)_ (2nd) 14 30 ‘ 
; AR2-> X(I2)_(1st) 15 31 <- X(I1) - [X(1I3)*COS + X(I4) *SIN] 
: X’ (13) TG: 32 KS. KO CEL). = "(33 
i AR3-> X(1I3)_ (1st) 17. 33 <- -X(1I2) - [X(13)*SIN -— X(14) *COS] 


Applications-Oriented Operations 6-69 


Fast Fourier Transforms (FFTs) 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


: X(13)_ (2nd) 18 #34 
. X(13)_ (3rd) 19 35 
7 < 
, 
; Cc -> 
; Xx’ (14)__ | 240 48 <- -X!’ (T4) 
; D -> . 
, 
, 
; X(14)_ (3rd) 29 61 
; X(14)_ (2nd) 30 62 : 
; AR4-> X(14)_ (1st) 31 63 <- X(I2) - [XK(1I3)*SIN - X(I4) *COS] 
; 32 64 
7 AR1-> 33 65 
, 
, 
; ‘ 
i \|/ 
, 
LDA @GFFT_SIZE, IRO 
LSH -2,IRO 
STI IRO, @SEPARATION 
LSH =2;1R0 
LDI 57 R5 
LDI 3,R7 
LDI 16,R6 
LDA @DEST_ADDR, AR5 
LDA @DEST_ADDR, AR1 
LSH =-1,I1R0 
LSH 1,R7 
LOOP: ADDI 1 ART 
LSH 1,R6 
LDA AR1,AR4 
ADDI R7,AR1 ;AR1L points at A. 
LDA AR1,AR2 
ADDI 2,AR2 ;AR2 points at B. 
ADDI R6,AR4 
SUBI R7,AR4 ;AR4 points at D. 
LDA AR4, AR3 
SUBI 2,AR3 ;AR3 points at C. 
LDA @SINE_TABLE, ARO ;ARO points at SIN/COS table. 
LDA R7,IR1 
LDI Rv, RC 
INLOP: ADDF3 *--AR1(IR1),*++AR2(IR1),RO ;RO = X’(I1) + X’ (13) --+ 
SUBF3 *— -AR3(IR1),*AR1++,R1 ;RL = xX’ (I1) - X' (13) -+ 
EGF *--AR4,R2 ;R2 = -X’ (14) --+ 
I | STF RO, *-AR1 ;X’ (11) < 
STF R1, *AR2-- ;X' (13) < + 
|| STF R2, *AR4++(IR1) 7X! (14) <-------4 b 
LDA @SEPARATION, IR1 ; IRI=SEPARATION BETWEEN SIN/COS 
TABLES 
SUBI 3,RC 
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PYF3 *++AR0 (IRO) , *AR4, R4 7R4 =X 
PYF3 *ARO, *++AR3,R1 ;Rl = X 
PYF3 *++ARO (IR1),*AR4,RO ;RO = X 
PYF3 *ARO, *AR3, RO ;RO = X 
|| SUBF3 R1,RO,R3 ;R3 = - 
PYF3 *++ARO (IRO), *-AR4,RO 
|| ADDF3 RO,R4,R2 ;R2 = X( 
SUBF3 *AR2,R3,R4 
, 
RPTBD IN_BLK 
ADDF3 *AR2,R3,R4 
STF R4, *AR3++ 7X (13) 
SUBF3 R2,*AR1,R4 ;R4 = X(I1) 
STF R4, *AR4-- 7X (14) 
ADDF3 *AR1,R2,R4 ;R4 = X(I1) 
STF R4, *AR2-- >X(I2) < 
LDF *-ARO (IR1),R3 ; 
MPYF3 *AR4, R3,R4 ; 
STF R4, *AR1L++ sX(I1) < 
MPYF3 *AR3,R3,R1 
MPYF3 *ARO, *AR3, RO 
SUBF3 R1,RO,R3 
MPYF3 *++ARO (IRO),*-AR4,RO 
ADDF3 RO,R4,R2 
SUBF3 *AR2,R3,R4 
ADDF3 *AR2,R3,R4 
STF R4, *AR3++ 
SUBF3 R2,*AR1,R4 
STF R4, *AR4-- 
IN_BLK:  ADDF3 *AR1,R2,R4 
STF R4, *AR2-- 
LDF *-ARO(IR1),R3 
MPYF3 *AR4,R3,R4 
STF R4, *AR1L++ 
MPYF3 *AR3,R3,R1 
MPYF3 *ARO, *AR3, RO 
SUBF3 R1,RO,R3 
LDA R6,IR1 
ADDF3 RO,R4,R2 
SUBF3 *AR2,R3,R4 
ADDF3 *BR2,R3,R4 
|| STF R4, *AR3++(IR1) 
SUBF3 R2,*AR1,R4 
|| STF R4, *AR4++(IR1) 
ADDF3 *AR1,R2,R4 
|| STF R4, *AR2++(IR1) 
STF R4, *AR1++(IR1) 
SUBI3 AR5,AR1,RO0 
CMP I @FFT_SIZE, RO 
BLTD INLOP 
LDA @SINE_TABLE, ARO 


;R4 = R3-X 


7;R4 = R3 + X(I2) 


ak 


* 


;LOOP BACK TO TH 
;ARO POINTS TO SIN/COS TABLE 


E INNER LOOP 
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LDA R7,IR1 

LDI R7,RC 

ADDI 1,R5 

CMP I @LOG_SIZE,R5 
BLE LOOP 

LDA @DEST_ADDR, AR1 
LSH -1,IR0 

LSH 1,R7 


Return to C environment. 


Else tet 


NDB: POP DP ;Restore C environment variables. 
POP AR7 
POP ARO 
POP ARS 
POP AR4 
POPF R7 
POP R7 
POPF R6 
POF R6 
POP R5 
POP R4 
POP FP 
RETS 

end 


* 


* No more. 
* 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK KKK 


6-72 


Example 6-18. Real Inverse Radix-2 FFT 


Fast Fourier Transforms (FFTs) 


KKK KK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


SYNOPSIS: 


int 
int 
flo 
flo 


flo 
int 


NOTE: 


1) 


2) 


iw) 


ESCRIPTIO 


Er 
I as follo 


+ + + + FF FF FF FFF FFF FF F FF F FF FF FF F FF F FF FF F FF F FF FF FH 


int ifft_rl(FFT_SIZ! 


ted from forward real FFT routine written by Alex 


C40 porting started from 
(C30). Expanded calling 


to registers for parameter passing. 


FILENAME IFFT_RL.ASM 
DESCRIPTION INVERSE FFT FOR TMS320C40 
DATE 1/19/93 
VERSION 2.0 
KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKAKKAKK KK KKK KAKK 
VERSION DATE COMMENTS 
1.4.0 2/18/92 DANIEL MAZZOCCO(TI Houston): 
Original Release (C30 version) 
Star 
Tessarolo, rev 2.0 
2.0 1/19/93 ROSEMARIE PIEDRA(TI Houston): 
C30 inverse real FFT version 1.0 
conventions 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KK KKK 


BE, LOG_SIZE, SOURCE_ADDR, DEST_ADDR, SINE_TABLE, BIT_REVERS 
ar2 2 r3 re rs re 
FFT_SIZE ; 64, 128, 256, 512, 1024, 
LOG_SIZE ; 6, 7, 8, 9, 10, : 
at * SOURCE_ADDR ; Points to where data is originated 
; and operated on. 
at *DEST_ADDR ; Points to where data will be stored. 
at *SINE_TABLE ; Points to the SIN/COS table. 
BIT_REVERSE ; = 0, Bit Reversing is disabled. 
; <> 0, Bit Reversing is enabled. 
If SOURCE_ADDR = DEST_ADDR, then in place bit reversing is 
performed, if enabled (more processor intensive). 
FFT_SIZE must be >= 64 (this is not checked). 


N: 


he input data array is FFT_SIZI 


Ws: 


Generic function to do an inverse radix-2 FFT computation on the C40. 
E-long with real and imaginary points R and 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KK KKK KKK KK KKK 
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SOURCE_ADDR[0] 


I (2) 


SOURCE_ADDR[FFT_S1IZI 


Eo 1] 


[The output data array will contain only real values. 


optionally implemented at the end of the func 


tion. 


-> I(1) 
Bit reversal is 


Th 


sine/cosin 


tabl 


for the twiddle factors is 


the following format: 


xpected to b 


+ + + + FF FF FF FF FF FF + FF F FF + FF + FF + FF + FF + FF + FF FF + FF F FF FF FX OF 


REGISTERS USED: RO, R1, R2, R3, R4, R5, R6, R7 
ARO, AR1, AR2, AR3, AR4, AR5, AR6, AR7 
IRO, IR1 
RC, “RS, RE 
DP 
MEMORY REQUIREMENTS: Program = 322 Words (approximately) 
Data = 7 Words 


supplied in 


SINE_TABLE [0] == sin(0*2*pi/FFT_SIZE 
sin(1*2*pi/FFT_SIZE 
sin((FFT_SIZE/2-2) *2*pi/FFT_SIZE) 

SINE_TABLE[FFT_SIZE/2-1] -> sin((FFT_SIZE/2-1) *2*pi/FFT_SIZE) 

NOTE: The table is the first half period of a sine wave. 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK KKK 


NOTE: 1.Calling C program can be compiled using either large or small model. 
Both calling conventions methods: stack or register for parameter 
passing are supported. 

2. Sections needed in linker command file: .ffttxt code 
.fftdat data 
3.The SOURCE_ADDR must be aligned such that the first LOG_SIZE bits 
are zero (this is not checked by the program). 
CAUTION: DP initialized only once in the program. Be wary with interrupt 


service routines.Ensure interrupt service routines save DP pointer. 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK 
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w 


+ + + + FF FF FF F FH 


NOTE: 


FFI_SIZE: 
LOG_SI 


SOURCE_ADDR: 
DEST_ADDR: 


FFT Size 


Th 
Th 


If 
performance 


ENCHMARKS: 


Stack = 


Assumptions 7 


Bit 


12 Words 


KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK KKK KKK KKK KK KKK 


Program in RAMO 

Reserved data in RAMO 

Stack on Local/Global Bus RAM 
Sine/Cosine tables in RAMO 


Reversing 


Processing and data destination in RAM1. 
Local/Global Bus RAM, 0 wait state. 


Data Source Cycles (C30) 


OFF 


RAM1 25120 approx. 


is number does not include the C callable overheads. 
is benchmark is the number of cycles between labels STARTB and ENDB 


.ffttxt is located in external SRAM, enable cache for faster 


Set 
-global 
-global 
sUsecr 
-usect 
sEBSCT 
-USece 


INE_TABL 


EPARATIO 


S 
BIT_REVERSE: 
S . 


sUsect 
AUsect 
~USece 


AR3 

ifft_rl 
STARTB, ENDB 
"  CEttdat’,-1 
" ifftdat”,1 
(spk Etat 
" st tttdat’ 1 
" ifftdat”,1 
oPee tat; dL 
".¢Ettdat’ 1 


; Initialize C Function. 


’ 


a PETE lg 


tye 


uwotu tuto Uo UU UU TOE 


c 


C 
NNNNNNNNNNNHN 
I 


with. RI 


DP FFT_SIZE 


wii 


DP 


EGPARM 


ffttxt” 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


;Entry execution point. 


;Reserve memory for arguments. 


;Preserve C environment. 


;Initialize DP pointer. 
yarguments passed in stack 
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Perform Last FFT loops first 


AR1-> 


AR2-> 


AR3-> 


X(I1)-X (12) ] *COS-[X (13) 


C => 
Dress 
AR4-> 


X(I1)-X (12) ] *SIN+[X (13) 


AR1-> 


ADDR 
@SINE_TABI 


@BIT_REVE 


RI TZ) 


X(12)_ (3rd) 


X(12)_ (2nd) 


X(12)_ (1st) 


XE (Tie 


X(1I (1st) 


(2nd) 


3) 
+X (14) ]*SI 
X (13) 
X (13)_ (3rd) 


—__ Xx" (14)_ 


\L/ 


™~ 


WNHOW 


24 


16 


48 


(loop 2 onwards). 


LOOP 
lst 2nd 
ear SCE). Sake Cia) 
<- + X€T2) 


Se RE CE) A 


2m ea: Soe 
Se RE (TLS ce (ES) 
= 


<- -X’ (14) * 2 


6-76 


Fast Fourier Transforms (FFTs) 


Example 6—18. Real Inverse Radix-2 FFT (Continued) 


STARTB: 


LOOP: 


INLOP: 


Eb EAS 


POW Pe be eee 


ADDR, AR5 
ADDR, AR1 


*--AR1(IR1),*--AR3(IR1),RO 
*AR3,*AR1,R1 

*--AR4,R2 
RO, *ARL++ 
-2.0,R2 
*— -AR2,R3 
R1, *AR3++ 
2.0,R3 
R3, *AR2++(IR1) 
R2, *AR4++(IR1) 
@FFT_SIZE,IR1 
@SINE_TABLE, ARO 
-2,IR1 
Sy RC 
*AR2,*AR1,R3 
*AR1L, *AR2,R2 
R3,*++ARO(IRO),R1 
*AR4,RA 

R3, *++ARO(IR1),RO 
*AR3,R4,R3 

R4, *AR3, R2 
R2,*ARL++ 

R2, *ARO--(IR1),R4 
R3, *AR2-- 

IN_BLK 
R4,R1,R3 
R2,*ARO,R1 

R3, *AR4-- 
R1,RO,R4 
*AR2,*AR1,R3 
*AR1L, *AR2,R2 
R3,*++ARO(IRO),R1 
R4, *AR3++ 

*AR4,RA 

R3,*++ARO (IR1),RO 


;step between two consecutive sines 


;stage number from 4 to M. 


;R7 is FFT_SIZE/4-1 


ipts) 


;and will be used to point at A & D. 
;R6 will be used to point at D. 


;R6 is FFT_SIZE at the lst loop 


;AR1 points at A. 


;AR2 points at B. 


;AR4 points at D. 


;AR3 points at C. 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne eS 


Ne Ne Ne Ne Ne Ne we Ne Ne Ne 


Me Ne Ne Ne Ne Ne Ne Ne Ne Ne 


(ie 15 for 64 


RO = X’ (11) + X’ (13) ---41 
Rl = X’ (I1) - X’ (13) -+ 

| 
X’ (I1) < |-4 
R2 = -2*X’ (14) --+ | 

| 
X’ (13) < + 
R3 = 2*X’ (12) -, 
X’ (12) <------- e 
X’ (14) <---------4 + 


IR1=SEPARATION BETWEEN SIN/COS TBLS 
ARO points at SIN/COS table 


R3 = X(I1)-X(1I2) 
R2 = X(I1)+X(I2) 
Rl = R3*SIN 

R4 = X(I4) 

RO = R3*COS 

R3 = X(I4)-X(I3) --|--+ 
R2 = X(13)+X(I4) 

X(I1) < 


R3 = X(I1)-X(1I2) 


R2 = X(1I1)+X(1I2) ---+ 
Rl = R3*SIN | 
X (13) | 
R4 = X(14) | 
RO = R3*COS | 


R4 = R2*COS 

RATA) < + 
R3 = R3*SIN + R2*COS -—----4 
Rl = R2*SIN 

A(TA4A) =< 

R4 = R3*COS -—- R2*SIN 
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IN_BLK: 


Perform T 


Ne Ne Ne Ne 


PHNNnNNE PHEnN,PN 


SI 


SI 


UBF 3 
DDF3 


*AR3,R4,R3 

R4, *AR3,R2 
R2,*AR1++ 
R2,*ARO--(IR1),R4 
R3, *AR2-- 
R4,R1,R3 
R2,*ARO,R1 

R3, *AR4-- 
R1,RO,R4 
*AR2,*AR1,R3 
*AR1, *AR2,R2 
R3,*++ARO (IRO),R1 


R3,*++ARO (IR1) ,RO 
*AR3,R4,R3 


@FFT_SIZE, RO 
INLOP 

*AR2++ (IRL) 
R7, IR1 

R7,RC 

1, RS 
@LOG_SIZE,R5 
LOOP 
@SOURCE_ADDR, AR1 
1, IR0 

-1,R7 


hird FFT loop 


ee ee ee 


x ww 
AN UW 
= 


X(I 


next 


dou 


= X(14)-X(I3 = Sat 
= X(13)+xX(14 | 

1) <------------- + 

= R2*COS 

2) < + 
= R3*SIN + R2*COS —---- 
= R2*SIN 

4) < 

= R3*COS - R2*SIN 

= X(I1)-X(I2) 

= X(I1)+X(1I2) ---+ 

= R3*SIN | 

3) | 

= X(14) | 

= R3*COS | 

= X(14)-X(1I3) --|--+ 
= X(13)+X (14) | 

1) <------------- + 

= R2*COS 

2) < + 
prepared for the next 
= R3*SIN + R2*COS —---- 
= R2*SIN 

4) < 

= R3*COS —- R2*SIN 


Y 


stage if any left 


ble step in sine table 


6-78 


P BACK TO THE INNER LOOP 


Fast Fourier Transforms (FFTs) 


Example 6—18. Real Inverse Radix-2 FFT (Continued) 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne 


Part A: 


AR1-> 


AR2 


AR3-> 


AR4-> 


AR1-> 


LOOP3_A: STF 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne eS 


Part: B: 


—11__ 


H 
WW 
ODAIDOBWNHERO 


\L/ 


<- X(I1) + X(I3) 
SS sie CE) 
<- X(I1) - X(1I3) 


<- -2 * X(14) 


@SOURCE_ADDR, AR1 


AR1, AR2 
AR1,AR3 
AR1,AR4 
2, AR2 
4, AR3 


-3,RC 
1,RC 
LOOP3_A 
6,AR4 
8, IRO 


@FFT_SIZE,RC 


@SINE_TABLE, ARO ; ARO points at SIN/COS table 
*AR3,R3 
R3, *AR1, RO ; RO = X! (11) + X’ (13) ---+ 
R3,*AR1,R1 ; Rl = xX’ (21) - xX’ (13) -+ | 
*AR4, R2 ; 
RO, *AR1++ (IRO) 7X? (1h x + 
-2.0,R2 ; RQ = -2*x’ (14) --+ 
*AR2,R3 j | 
R1, *AR3++ (IRO) pet 3). | ; 
2.0,R3 ; R3 = 2*x" (12) -, | 
R3, *AR2++ (IRO) ; X! (12) <------- aan 
R2, *AR4++ (IRO) ; X' (14) <--------- + 
0 
Ad.) eS ORE) + CE) 
2 
12) Se ee RATA = RTS) 
4 
I3 5 <- [X(1I1)-X(1I2) ]*COS-[X(1I3)+X(14)]*SIN 
6 
14 To<= TX CEL) HK Ce) I 4S IN+ [XLS ) 4k (14) 1 FC0s 
8 
9 NOTE: COS(2*pi/8) = SIN(2*pi/8) 
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; | aan 
i z 
7 . 
i \1/ 
; LDA @SOURCE_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
LDA AR1,AR4 
ADDI 1,AR1 
ADDI 3, AR2 
ADDI 5,AR3 
ADDI 7,AR4 
LDA @SINE_TABLE, AR7 ; AR7 points at SIN/COS table 
LDI @FFT_SIZE,RC 
LSH -3,RC 
LDA RC, IR1 
SUBI 2,RC 
LDF *AR2,R6 ; R6 = X(I2) 
LDF *AR3, RO ; RO = X(I3) 
ADDF3 R6, *AR1,R5 j; RS = X(1I1)+X(1I2) ---------4 + 
SUBF3 R6, *AR1,R4 ; R4 = X(I1)-X(1I2) 
SUBF3 RO,R4,R3 ; R3 X(I1) -X (12) -X (13 
ADDF3 RO,R4,R2 ; R2 X(I1)-X (12) +X (13 
SUBF3 RO, *AR4,R1 ; RL = X(14)-X(1I3) --------- —+ 
STF R5, *AR1++(IRO) 2X1 < + 
, 
RPTBD LOOP3_B ; 
ADDF3 R2,*AR4,R5 ; RS = X(I1)-X (12) +X (13) +X (14) 
STF Rl, *AR2++(IRO) ; X(I2) < 
MPYF3 R5, *++AR7(IR1),R1 ; Rl = R5*SIN + 
SUBF3 *AR4,R3,R2 ; R2 X (I1) -X (12) -X (13) -X (14) 
MPYF3 R2,*AR7,RO ; RO = R2*SIN ---+ 
STF R1, *AR4++(IRO) ; X(I4) < | 4 
; | 
, 
LDF *AR2,R6 ; R6 = X(I2) | 
STF RO, *AR3++ (IRO) j} X(13) <----------- + 
ADDF3 R6,*AR1,R ; RS = X(I1)+X(1I2) ---------4 - 
LDF *AR3, RO ; RO = X(I3) 
SUBF3 R6, *AR1,R4 ; R4 = X(I1)-X(I2) 
SUBF3 RO,R4,R3 ; R3 X(I1)-X (12) -X (13 
ADDF3 RO,R4,R2 ; R2 X(I1) -X (12) +X (13 
SUBF3 RO, *AR4,R1 ; RL = X(14)-X(1I3) --------- —+ 
ji STF R5, *AR1++(IRO) ; X(I1) < + 
ADDF3 R2,*AR4,R5 ; R5 = X(I1)-X (12) +X (13) +X (14) 
|| STF R1, *AR2++(IRO) pr RQ) < 
MPYF3 R5,*AR7,R1 ; Rl = R5*SIN < + 
|| SUBF3 *AR4,R3,R2 ; R2 X (I1) -X (12) -X (13) -X (14) 
LOOP3_B: MPYF3 R2, *AR7, RO ; RO = R2*SIN 
|| STF R1, *AR4++(IRO) ; X(I4) < + 
STF RO, *AR3 7X (13) 
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Example 6—18. Real Inverse Radix-2 FFT (Continued) 


Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne 


Perform first and second FFT loops. 


| AR1 -> |__I1 0 <= X(I1) 
| AR2 -> |__ 172) 1 <- X(T1) 
| AR3 -> |__I3__| 2 <- X(11) 
|_ AR4 -> |__14 3 <= X(T1) 
AR1 -> | 4 
| 
\I/ 
LDA @SOURCE_ADDR, AR1 
LDA AR1,AR2 
LDA AR1, AR3 
LDA AR1,AR4 
ADDI 1,AR2 
ADDI 2,AR3 
ADDI 3,AR4 
LDA 4, IRO 
LDI @FFT_SIZE,RC 
LSH -—2,RC 
SUBI 2 RC 
LDF *AR4,R6 ; 
LDF *AR2,R7 ; 
LDF *AR1,RI1 ; 
MPYF 2.0,R6 7 
MPYF 2.0,R7 : 
SUBF3 R6, *AR3,R5 ; 
SUBF3 R5,R1,R4 H 
SUBF3 R7, *AR3,R5 : 
STF R4,*AR4++(IRO) ; 
ADDF3 R5,R1,R3 : 
ADDF3 R6, *AR3,R4 : 
STF R3,*AR2++(IRO) ; 
RPTBD LOOP1_2 : 
SUBF3 R4,R1,R4 ; 
ADDF3 R7, *AR3,RO : 
SIE R4,*AR3++(IRO) ; 
ADDF3 RO,R1,RO0 : 
LDF *AR4,R6 : 
ole RO, *AR1++(IRO) ; 
MPYF 2.0,R6 7 
LDF *AR2,R7 ; 
LDF *AR1,RI1 ; 
MPYF 2.0,R7 : 
SUBF3 R6, *AR3,R5 7 
SUBF3 R5,R1,R4 F 
SUBF3 R7, *AR3,R5 ; 
STF R4,*AR4++(IRO) ; 
ADDF3 R5,R1,R3 ; 
ADDF3 R6, *AR3,R4 : 


+ X(I3) + 2*X(12) 

be XKCES) = 2*X(L2) 

-— X(13) - 2*xX(T14) 

-— X(13) + 2*xX(14) 

R6 = X(I4) 

R7 = X(I2) 

Rl = X(I1) 

R6 = 2 * X(I14) 

R7 = 2 * X(I2) 

R5 = X(1I3) - 2*X(T4) 

R4 = X(I1)-X(13)+2X(14) --+ 
R5 = X(1I3) - 2*X(I2) 
X (14) < 

R3 = X(I1)+X(1I3)-2X(1I2) --+ 
R4 = X(1I3) + 2*X(T4) 
X (12) < 

R4 X(I1)-X(13)-2xX (14) --+ 
RO = X(I3) + 2*X(I2) 
X(I3) < 

RO = X(I1)+X(1I3)+2X(1I2) --+ 
R6 = X(T4) 
X(I1) < 

R6 = 2 * X(I4) 

R7 = X(I2) 

Rl = X(I1) 

R7 = 2 * X(I2) 

R5 = X(1I3) - 2*X(T4) 

R4 = X(I1)-X(13)+2X(14) --+ 
R5 = X(1I3) - 2*X(I2) 
X (14) < 

R3 = X(I1)+X(1I3)-2X(1I2) --+ 
R4 X(13) + 2*xX(14) 
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Fast Fourier Transforms (FFTs) 


Example 6—18. Real Inverse Radix-2 FFT (Continued) 


Check Bit Reversing Mode (on or off) 


[EL we we we Ne te te 


BIT_REVERSING = 0, then OFF (no bit reversing) 
BIT_REVERSING <> 0, Then ON 
NDB: LDI @BIT_REVERSE, RO 
BZ MOVE_DATA 
Check Bit Reversing Type. 


If SourceAddr DestAddr, 
If SourceAddr <> DestAddr, 


Ne Ne Ne Ne Ne Ne 


LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BEQ IN_PLACE 


Bit reversing Type 1 (From Source to Destination). 


Ne Ne Ne Ne Ne 


NOTE: abs (SOURCE_ADDR — DEST_ADDR) must be > FFT_SIZI 
LDI @FFT_SIZE, RO 
SUBI 2,R0 
LDA @FEFT_SIZE, IRO 
LSH-1, IRO ; IRO = Half 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R 
[| STF R1, *AR1++ (IRO)B 
STF R1, *AR1++(IRO)B 
BR DIVISION; 
; In Place Bit Reversing. 
, 
; Bit Reversing On Even Locations, lst Half Only. 
IN_PLACE: LDA GFFT_ SIZE, IRO 


|| STF R3, *AR2++ (IRO) ; X(12) < 
SUBF3 R4,R1,R4 ; R4 = X(I1)-X(13)-2x(I4) --+ 
ADDF3 R7, *AR3, RO ; RO = X(I3) + 2*X(I2) 

|| STF R4, *AR3++ (IRO) ; X(I13) < 

LOOP1_2: ADDF3 RO,R1, RO ; RO = X(I1)+X(I3)+2X(I2) --+ 
STF RO, *ARL ; LAST X(I1) < 


[Then In Place Bit Reversing 
[Then Standard Bit Reversing 


this is not checked. 


ay 


, 


FFT size. 
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Example 6—18. Real Inverse Radix-2 FFT (Continued) 


LSH =2,IR0 ; IRO = Quarter FFT size. 
LDA D2 TRA 
LDI @FFI_SIZE,RC 
LSH -2,RC 
UBL 3,RC 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
LDA ARO, AR2 
NOP *AR1++(IRO)B 
NOP *AR2++(IRO)B 
LDF *++ARO (IR1),RO 
LDF *AR1,R 
RPTBD BITRV1L 
CMP I AR1, ARO ; Xchange Locations only if ARO<AR1. 
LDFGT RO,R1 
LDFGT *AR1++(IRO)B,R1 
LDF *++ARO (IR1),RO 
| | STF RO, *ARO 
LDF *AR1,R1 
| | STF R1, *AR2++(IRO)B 
CMP I AR1, ARO 
LDFGT RO,R1 
BITRVL: LDFGT *AR1++(IRO)B, RO 
STF RO, *ARO 
STF R1, *AR2 
;Perform Bit Reversing Odd Locations, 2nd Half Only 
LDI @FFT_SIZE,RC 
LSH aL eRe 
LDA @DEST_ADDR, ARO 
ADDI RC, ARO 
ADDI 1,AR0 
LDA ARO, AR1 
LDA ARO, AR2 
LSH RE 
SUBI 3,RC 
NOP *AR1++(IRO)B 
NOP *AR2++(IRO)B 
LDF *++ARO (IR1),RO 
LDF *AR1,R1 
RP TBD BITRV2 
CMP I AR1, ARO ; Xchange Locations only if ARO<AR1. 
LDFGT RO,R1 
LDFGT *AR1++(IRO)B,R1 
LDF *++ARO (IR1),RO 
| | STF RO, *ARO 
LDF *AR1,R1 
| | STF R1, *AR2++(IRO)B 
CMP I AR1, ARO 
LDFGT RO,R1 
BITRV2: LDFGT *AR1++ (IRO)B, RO 
STE RO, *ARO -O STE R1,*AR2 later 
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Fast Fourier Transforms (FFTs) 


Example 6—18. Real Inverse Radix-2 FFT (Continued) 


; Perform Bit Reversing On Odd Locations, lst Half Only. 


LDI @FFT_SIZE,RC 

LSH -1,RC 

LDA RC, IRO 

LDA @DEST_ADDR, ARO 

LDA ARO, AR1 

ADDI 1,AR0 

ADDI IRO,AR1 

LSH -1,RC 

LDA RC, IRO 

SUBI 2,RC 

RPTBD BITRV3 

STF R1, *AR2 

LDF *BRO, RO 

LDF *AR1,R1 

LDF *++AR0 (IRL) ,RO 
|| STF RO, *AR1++ (IRO)B 
BITRV3: LDF *AR1,R1 
|| STF R1, *-ARO (IR1) 

STF RO, *ARL 

STF R1, *ARO 

BR DIVISION 


Check Data Source Locations. 


TZ xe Ne te te ete 


If SourceAddr = DestAddr, Then do nothing. 
If SourceAddr <> DestAddr, Then move data. 

OVE_DATA: LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BE DIVISION 
LDI @GFFT_SIZE, RO 
SUBI 2,R0 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R1 

\ | STF Rl, *AR1++ 
STF R1, *AR1 

DIVISION: LDA 2, IR0 
LDI @GFFT_SIZE, RO 
FLOAT RO ; exp = LOG_SIZE 
PUSHF RO ; 32 MSB’S saved 
POP RO 
EGI RO ; Neg exponent 
PUSH RO 
POPF RO ; RO = 1/FFT_SIZE 
LDA @DEST_ADDR, AR1 
LDI @FFT_SIZE, RC 
LSH -1,RC 
SUBI 2,RC 
RP TBD LAST_LOOP 
LDA @DEST_ADDR, AR2 
OP *AR2++ 
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Example 6—18. Real Inverse Radix-2 FFT (Continued) 


PYF3 RO, *AR1,R1 ; lst location 
PYF3 RO, *AR2,R2 y 2nd, 4th, 6th;.<«. location 
| | STF R1, *AR1++(IRO) 
LAST_LOOP: MPYF3 RO, *AR1,R1 ¢ 3rd,5th, 7th,.«., locatzron 
| | STF R2, *AR2++(IRO) 
PYF3 RO, *AR2,R2 ; last location 
| | STF R1, *AR1 
STF R2, *AR2 
; Return to C environment 
POP DP ; Restore C environment variables. 
POP AR7 
POP AR6 
POP AR5 
POP AR4 
POPF R7 
POP R7 
POPF R6 
POP R6 
POP R5 
POP R4 
POP FP 
RETS 
.end 
* 
* No more. 
* 
KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKK KKK KK KKK 
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6.6 °C4x Benchmarks 


Table 6—1 provides benchmarks for common DSP operations. Table 6-2 sum- 
marizes the FFT execution time required for FFT lengths between 64 and 1024 
in Example 6-12, Example 6-14, 
Example 6-17, Example 6-18, and Example 6-15. 


points for the four algorithms 


The benchmarks are given in cycles (the H1 internal processor cycle). To get 
the benchmark (time), multiply the number of cycles by the processor’s inter- 


nal clock period. For example, for a 50 MHz ’C4x, multiply by 40 ns. 


Table 6-1. ’C4x Application Benchmarks 


Application Words 
Inverse of a float (32-bit mantissa accuracy) 7 
Double-precision integer multiply 2 
Square root (32-bit mantissa accuracy) 11 
Vector dot productt 6 
Matrix Times a Vector 10 
FIR Filter 6 
IIR Filter (One Biquad) 7 
IIR Filter (N>1 Biquads) 15 
LMS Lattice Filter 11 
Inverse LPC Lattice Filter AS) 
Mu-law (A-law) Compression 15 (16) 
Mu—law (A4law) Expansion 11 (15) 


T Based on a modification of the matrix times a vector benchmark 
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Cycles 
7 


2+6N 

1+5P 

3+3P 
14 (16/10) 
11/10 (15/13) 


‘C4x Benchmarks 


Table 6-2. FFT Timing Benchmarks (Cycles) 


Complex Real 
Radix-2 Radix-4 Radix-2 Forward Inverse 
Points Example 6—12 Example 6—14 Example 6-15 Example 6-17 Example 6-18 

64 2290T 1745t 1425t 75et 1012T 
128 5179T — 3336T 1683 2269T 
256 11588t 9216T 7655t 3814T 5086t 
512 25677t —— 17302t 8633 11343T 
1024 564114 47237+ 38945+ 19404T 25120T 


Assumptions: 
Tt The datais in on-chip RAM1. Program (.fftxt) and reserved data (.fftdat) are in on-chip RAMO. The sine/Cosine table is in on-chip 


RAMO. Bit-reversing is not considered. The cache is enabled 
+ The data is in on-chip RAM. Program (.ffttxt) and reserved data (.fftdat) are a in local(global) bus RAM with 0-wait states. Bit 


reversing is not considered. The sine/cosine table is on the global(local) bus. The cache is enabled 
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Chapter 7 


Programming the DMA Coprocessor 


The ’C4x DMA (Direct Memory Access) coprocessor is a’C4x peripheral mod- 
ule. With its six channels, the DMA maximizes sustained CPU performance by 
alleviating the CPU of burdensome I/O. Any of the six DMA channels can 
transfer data to and from anywhere in the ’'C4x’s memory map for maximum 
flexibility. 


Topic Page 
7.1. Hints for DMA Programming ............ cece eee eee eee eee 7-2 
7.2. When a DMA Channel Finishes a Transfer ..............++2++2+05 7-3 
7.3. DMA Assembly Programming Examples ...........0..0eeeeeueee 7-4 
7.4 DMAC-Programming Examples ............:eeeeeeeeee eee ee eee 7-9 
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7.1. Hints for DMA Programming 


The following hints will help you improve your DMA programming and also help 
you avoid unexpected results: 


Lj Reset the DMA register before starting it. This clears any previously 


latched interrupt that may no longer exist. Also, set the DIE register (enab- 
ling interrupts for sync transfer) after starting the DMA channel. 


Take care in selecting the priority used to arbitrate between the CPU and 
DMA and also between DMA channels. If a DMA channel fails to finish a 
block transfer, it may have lower priority in a conflicting environment and 
and not be granted access to the resource. CPU/DMA rotating priority is 
considered a safe first choice. Depending on CPU/DMA execution load, 
selection of other priority schemes could result in faster code. Fine tuning 
may be needed. 


Ensure that each interrupt is received when you use interrupt synchroniza- 
tion; otherwise, the DMA will never complete the block transfer. 


For faster execution, avoid memory/resource access conflicts between 
the CPU and DMA. Carefully allocate the different sections of the program 
in memory. Use the same care with DMA autoinitialization values in 
memory. 


Try to use read/write synchronization when reading from or writing to com- 
munication ports. This avoids a peripheral-bus halt during a read from an 
empty-input FIFO or a write to a full-output FIFO. 


Choose between DMA read and write synchronization when using a DMA 
channel to transfer from one communication port to another. The ’C4x 
does not allow synchronization of DMA channel reads/writes with ICRDY/#/ 
OCRDYj signals coming from two different communication ports (/# /) 


When your application requires initializing the primary (or auxiliary) DMA 
channel while the auxiliary (or primary) channel may still be running, halt 
the running channel by writing a halt signal to the START or AUX START 
bits. Before proceeding, check the STATUS or AUX STATUS bits of the 
running channel to ensure it has halted. This is necessary because the 
DMA halt takes place in read/write boundaries (depending on the type of 
halt issued), and the channel must wait for any ongoing read or write 
cycles to complete. When reinitializing this channel, be especially careful 
to restore its previous status exactly. For an example of how to deal with 
this situation, refer to the Designer Notebook Page, split-mode DMA re-ini- 
tialization, available through the DSP hotline. 


When a DMA Channel Finishes a Transfer 


7.2 When a DMA Channel Finishes a Transfer 


Many applications require that you perform certain tasks after a DMA channel 
has finished a block transfer. 


You can program the DMA to interrupt the CPU when this happens (TCC or 
AUX TCC bits). You can also achieve this by polling if: 


a) 


The corresponding IIF (DMA INTx) bit is set to 1 (interrupt polling). 
This requires that the DMA control register TCC (or AUX TCC) bit be set 
first. This method does not cause any extra CPU/DMA access conflict. But 
its drawback, when using split mode, is that you cannot differentiate 
whether the primary or auxiliary channel has finished. 


The transfer counter has a zero value. This option is sometimes not reli- 
able, because the DMA channel could be in the middle of an autoinitializa- 
tion sequence. 


The TCINT (or AUX TCINT flag) is set to 1. This option is reliable, but the 
CPU is polled via the peripheral bus, potentially causing CPU/DMA ac- 
cess conflict if the DMA is operating to/from the peripheral bus. This is a 
good option if you do not foresee any problem with the additional access 
delay. 


The START (AUX START) bits in the DMA channel control register are 
set to 109. This option can also cause a CPU/DMA access conflict. 
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7.3. DMA Assembly Programming Examples 


The DMA coprocessor is a memory-mapped peripheral that you can easily 
program from C as well as from assembly. Example 7—1 through Example 7—5 
provide examples on programming the DMA coprocessor using assembly lan- 
guage. Example 7-6 through Example 7-11 provide examples on program- 
ming the DMA coprocessor from C. The source code for examples 
Example 7-6 through Example 7—11 can be found in the TI BBS (self-extract- 
ing file: C4xdmaex.exe). 


Example 7—1 shows one way for setting up DMA channel 2 to initialize an array 
to zero. This DMA transfer is set up to have priority over a CPU operation and 
to generate an interrupt flag, DMA INT2, after the transfer is completed. The 
DMA control register is set to 00C4 0007h. 


Example 7-1.Array initialization With DMA 


* 
* ITLE ARRAY INITIALIZATION WITH DMA 
* 
* HIS EXAMPLE INITIALIZES A 128 ELEMENTS ARRAY TO ZERO. THE DMA 
a RANSFER IS SET UP TO HAVE HIGHER PRIORITY OVER CPU OPERATION. 
* HE DMA INT2 INTERRUPT FLAG IS SET TO 1 AFTER THE TRANSFER IS 
* COMPLETED. 
* 
.data 
DMA2 .word 0O01000COH ;DMA channel 2 map address 
CONTROL .word 0O0C40007H ;DMA register initialization data 
SOURCE .word ZERO 
SRC_IDX .word 0 
COUNT .word 128 
DESTIN .word ARRAY 
DES_IDX .word 1 
ZERO . float O40 ;Array initialization value 0.0 
.bss ARRAY, 128 
«Cert 
START LDP @DMA2 ;Load data page pointer 
LDA @DMA2, ARO ;Point to DMA channel 2 registers 
LDI @SOURCE, RO ;Initialize DMA source register 
STI RO, *+ARO (1) 
LDI @SRC_IDX, RO ;Initialize DMA source index register 
STI RO, *+ARO (2) 
LDI @COUNT, RO ;Initialize DMA count register 
STI RO, *+ARO (3) 
LDI @DESTIN, RO ;Initialize DMA destination register 
STI RO, *+ARO (4) 
LDI @DES_IDX, RO ;Initialize DMA destination index register 
STI RO, *+ARO (5) 
LDI @CONTROL, RO ;Start DMA channel 2 transfer 
Sit RO, *ARO 
end 
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The DMA transfer can be synchronized with external interrupts, communica- 
tion-port ICRDY/OCRDY signals, and timer interrupts. In order to enable this 
feature, the SYNCH MODE field, bits 6—7, of the DMA-control register must be 
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configured to a proper value, and the corresponding bits of the DMA-interrupt 
enable (DIE) register must be set. Example 7—2 sets up DMA channel 4 read 
synchronization with the communication-port 4 ICRDY signal. The DMA con- 
tinuously transfers data from the communication-port input register until the 
START field, bits 22—23 of the DMA control register, is changed by the CPU. 


Example 7-2.DMA Transfer With Communication-Port ICRDY Synchronization 


* 
* ITLE DMA TRANSFER WITH COMMUNICATION PORT ICRDY 
* SYNCHRONIZATION 
* 
HIS EXAMPLE SETS UP DMA CHANNEL 4 TO TRANSFER DATA FROM 
* COMMUNICATION PORT INPUT REGISTER TO INTERNAL RAM WITH ICRDY 
us SIGNAL READ SYNCHRONIZATION. HE TRANSFER MODE OF THE DMA IS 
* SE TO 00. THEREFORE THE TRANSFER WON’ STOP UNTIL THE START 
* BITS OF THE DMA CONTROL REGISTER IS CHANGED. 
x -data 
DMA4 .word OO1OOO0EOH ;DMA channel 4 map address 
CONTROL -word O0O0CO0040H ;DMA register initialization data 
SOURCE .word 00100081H 
SRC_IDX .word 0 
COUNT .word 0 ;Transfer counter is set to largest value 
DESTI word OO2FF800H 
DES_IDX word lt 
-text 
START LDP @DMA4 ;Load data page pointer 
LDA @DMA4, ARO ;Point to DAM channel 4 registers 
LDI @SOURCE, RO ;Initialize DMA source register 
STI RO, *+ARO (1) 
LDI @SRC_IDX, RO ;Initialize DMA source index register 
STI RO, *+ARO (2) 
LDI @COUNT, RO ;Initialize DMA count register 
STI RO, *+ARO (3) 
LDI @DESTIN, RO ;Initialize DMA destination register 
STI RO, *+ARO (4) 
LDI @DES_IDX, RO ;Initialize DMA destination index register 
STI RO, *+ARO (5) 
LDI @CONTROL, RO ;Start DMA channel 4 transfer 
SIL RO, *ARO 
LDHI 010H, DIE ;Enable ICRDY 4 read sync. 
end 


If external interrupt signals are used for DMA transfer synchronization, then 
pins IIOFO-3 must be configured as interrupt pins. 


The ’C4x DMA split mode is another way besides memory-map address to 
transfer data from/to the communication port. When the split-mode bit of the 
DMA control register is set, the DMA is separated into primary and auxiliary 
channels. The primary channel transfers data from memory to the commu- 
nication-port output register, and the auxiliary channel transfers data from the 
communication port to memory. The communication-port number is selected 
in bits15-17 of the DMA control register. 


Example 7-3 shows how to set up DMA channel 1 into split mode. The DMA 
primary channel transfers data from internal RAM to communication port 3 
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through external interrupt INT2 synchronization and bit-reversed addressing. 
The DMA auxiliary channel transfers data from communication port 3 to inter- 
nal RAM via external interrupt INT3 synchronization and linear addressing. 


Example 7—3.DMA Split-Mode Transfer With External-Interrupt Synchronization 


* 
* TITLE DMA SPLIT-MODE TRANSFER WITH EXTERNAL INTERRUPT SYNCHRONIZATION 
* 
* THIS EXAMPLE SETS UP DMA CHANNEL 1 TO SPLIT-MODE. THE PRIMARY CHANNEL TRANSFERS 
* DATA FROM INTERNAL RAM TO COMM PORT 3 OUTPUT REGISTER WITH EXTERNAL INTERRUPT 
* INT2 SYNCHRONIZATION AND BIT-REVERSED ADDRESSING. THE AUXILIARY CHANNEL TRANSFERS 
* DATA FROM COMMUNICATION PORT 3 INPUT REGISTER TO INTERNAL RAM WITH EXTERNAL 
* INTERRUPT INT3 SYNCHRONIZATION AND LINEAR ADDRESSING. 
* 
-data 
DMA1 -word 001000B0OH ;DMA channel 1 map address 
CONTROL -word O3CDDOD4H ;DMA register initialization data 
SOURCE -word 002FFCOOH 
SRC_IDX -word 08H ;The same value as IRO for bit-reversed 
COUNT word 8 
DESTIN word 002FF800H 
DES_1IDX word 1 
AUX_CNT .-word 8 -text 
STAR LDP @DMA1 ;Load data page pointer 
LDA @DMA1, ARO ;Point to DAM channel 1 registers 
LDI @SOURCE, RO ;Initialize DMA primary source register 
SEE RO, *+ARO (1 
LDI @SRC_IDX, RO ;Initialize DMA primary source index register 
SLL RO, *+ARO (2 
LDI @COUNT, RO ;Initialize DMA primary count register 
Si. RO, *+ARO (3 
LDI @DESTIN, RO ;Initialize DMA aux destination register 
STI RO, *+ARO (4 
LDI @DES_IDX, RO ;Initialize DMA aux destination index register 
SED RO, *+ARO (5 
LDI @AUC_CNT, RO ;Initialize DMA auxiliary count register 
STI RO, *+ARO (7 
LDI @CONTROL, RO ;Start DMA channel 1 transfer 
STI RO, *ARO 
LDI 01100H, IIF ;Configure INT2 and INT3 as interrupt pins 
LDI OAOH, DIE ;Enable INT2 read and INT3 write sync. 
end 


7-6 


An advantage of the ’C4x DMA is the autoinitialization feature. This allows you 
to set up the DMA transfer in advance and makes the DMA operation com- 
pletely independent from the CPU. When the DMA operates in autoinitializa- 
tion mode, the link pointer and auxiliary link pointer initialize the registers that 
control the DMA operation. The link pointer can be incremented (AUTOINIT 
STATIC = 0) during autoinitialization or held constant (AUTOINIT STATIC = 1) 
during autoinitialization. This option allows autoinitialization values to be 
stored in sequential memory locations or in stream-oriented devices such as 
the on-chip communication ports or external FIFOs. When DMA SYNC MODE 
is enabled, The DMA autoinitialization operation can be configured to synchro- 
nize with the same signal. Example 7—4 sets up DMA channel 0 to wait for the 
communication port to input the initialization value. After DMA autoinitializa- 
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tion is complete, the DMA channel starts transferring data from the communi- 
cation port input register to internal RAM. 


Example 7—4.DMA Autoinitialization With Communication Port ICRDY 


* 
* ITLE DMA AUTOINITIALIZATION WITH COMMUNICATION PORT ICRDY 
* 
7 HIS EXAMPLE SETS UP DMA CHANNEL 0 TO WAIT FOR COMMUNICATION 
* PORT TO INPU THE INITIALIZATION VALUE. HE DMA AUTOINITIAL— 
* IZATION AND TRANSFER ARE BOTH DRIVE BY ICRDY 0 FLAG. AFTER 
* DMA AUTOINIT IS COMPLETED, HE DMA CHANNEL STARTS TRANSFERRING 
x DATA FROM COMM PORT INPUT REGISTER TO INTERNAL RAM WITH ICRDY 
* Q READ SYNCHRONIZATION. THE VALUES IN COMM PORT 0 INPUT FIFO 
* SHOULD BE: 
* 
SEQUENCE | VALUE 
id + 
al dl 00C40047H (STOP AFTER TRANSFER COMPLETED) 
7 OR 00C4054BH (REPEAT AFTER TRANSFER COMPLETED) 
. 2 00100041H 
* 3 OH 
* 4 20H 
7 5 002FF800H 
* 6 1H 
7 7 00100041H 
* 
data 
DMAO .word OO1O000A0H ;DMA channel O map address 
DMA_INIT .word 0004054BH ;DMA initialization control word 
LINK .word 00100041H 7;Comm port input register address 
DMA_START .word 00C4054BH ;DMA start control word 
eEext 
START LDP @DMAO ;Load data page pointer 
LDA @DMAO, ARO ;Point to DMA channel 0 registers 
LDI @DMA_INIT, RO ;Initialize DMA control register 
STI RO, *ARO 
LDI @LINK, RO ;Initialize DMA link pointer 
STI RO, *+ARO (6) 
LDI @DMA_START,RO ;Start DMA channel O transfer 
STIL RO, *ARO 
LDI 01H,DIE ;Enable ICRDY O read sync. 
.end 


The DMA autoinitialization and transfer continues executing if the DMA autoin- 
itialization is still enabled. Therefore, a DMA setup like the one in Example 7—4 
can make it possible for an external device to control the DMA operation 
through the communication port. 


With the autoinitialization feature, the ’C4x DMA coprocessor can support a 
variety of DMA operations without slowing down CPU computation. A good ex- 
ample is a DMA transfer triggered by one interrupt signal. Usually, this is imple- 
mented by starting a DMA activity with a CPU interrupt service routine, but this 
utilizes CPU time. However, as shown in Example 7-5, you can set up a single 
interrupt-driven dummy DMA transfer with autoinitialization. When the inter- 
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rupt signal is set, the DMA will complete the dummy DMA transfer and start 


the autoinitialization for the desired DMA transfer. 


Example 7-5. Single-Interrupt-Driven DMA Transfer 


* 
x ITLE SINGLE INTERRUPT-DRIVEN DMA TRANSFER 
* 
igh HIS EXAMPLE SETS UP A DUMMY DMA TRANSFER FROM INTERNAL RAM 
x O HE SAME MEMORY WITH EXTERNAL INT 0 SYNCHRONIZATION AND 
* AUTOINITIALIZATION FOR RANSFERRING 64 DATA FROM LOCAL MEMORY 
* O INTERNAL RAM. AFTER THE SECOND TRANSFER IS COMPLETED, THE 
* DMA IS RE-INITIALIZED TO FIRST DMA TRANSFER SETUP. 
* 
data 
DMA5 word OO1O00FOH ;DMA channel 5 map address 
DMA_INIT word 0000004BH ,;DMA initialization control word 
LINK word DMA1 ;ilst DMA link list address 
DMA_START .word O0O0CO004BH ;DMA start control word 
DMA1 .word O0O0CO004BH ;ilst dummy DMA transfer link list 
-word OO02FF800H 
-word OO0000000H 
-word 00000001H 
-word OO02FF800H 
-word O0O0000000H 
.word DMA2 
DMA2 -word 0O0C4000BH ;The desired DMA transfer link 
.word 00400000H gular sit 
-word 00000001H 
-word 00000040H 
.word OO02FF800H 
-word 00000001H 
word DMA1 
text 
START LDP @DMA5 ;Load data page pointer 
LDA @DMA5, ARO 7;Point to DMA channel 5 registers 
LDI @DMA_INIT, RO ;Initialize DMA control register 
STI RO, *ARO 
LDI @LINK, RO ; Initialize DMA link pointer 
STI RO, *+ARO (6) 
LDI @DMA_START, RO ;Start DMA channel 5 transfer 
STI RO, *ARO 
LDI 01H, IIF ;Configure INTO as interrupt pins 
LDHI 0800H,DIE ;Enable INT 0 read sync. for 
;DMA channel 5 
end 
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7.4 DMA C-Programming Examples 


Example 7-6 to Example 7—11 includes DMA programing examples from C. 
These examples cover unified and Split mode, DMA autoinitialization and 
DMA synchronization operations. Descriptions of the examples presented are 
as follows: 


a 


L 


L) 


Example 7-6: Unified-mode DMA transfers data between commports us- 
ing read sync. 


Example 7—7: Unified-mode DMA uses autoinitialization (method 1) to 
transfer 2 data blocks. 


Example 7-8: Unified-mode DMA uses autoinitialization (method 2) to 
transfer 2 data blocks. 


Example 7-9: Split-mode auxiliary DMA transfers data between comm- 
ports using read sync. 


Example 7-10: Split-mode auxiliary and primary channel send/receive 
data to and from commport 


Example 7-11: Split-mode DMA autoinitializes both auxiliary and primary 
channels (auxiliary transfers 1 block and primary transfers 2 blocks) 


Example 7-12 is the include file for all examples (dma.h). 


Programming the DMA Coprocessor 7-9 


DMA C-Programming Examples 


Example 7-6. Unified-Mode DMA Using Read Sync 


[BRR KR KR KR KK KK KK KK KK KK KR RK RR RK RR RK RR KK RK OK OK KK OK KK KK OK 


EXAMPLE: Unified-mode 
Commport-—to-commport transfer: 
DMA3 in unified mode transfers 8 words from commport 3 to commport 0. 
DMA3 source sync with ICRDY3 is used. 
Note: Writes cannot be synchronized with OCRDYO, because a DMA i can 
only be synchronized with signals coming commport i. You could sync 
on ICRDY3 or on OCRDYO, not both (the choice depends on the specific 
application to avoid deadlock). 
In this program, DMA3 expects data in commport 3 being sent by 


another processor/device. Otherwise no transfer will occur. 
FER RR RRR RR RRR A RAR RA AR RRA AA RRR AA RA A A I I I I / 


include "dma.h” 

define DMAADDR 0x001000d0 

define CTRLREG 0x00c40045 /* DMA sends interrupt to CPU when transfer 

finishes (TC=1),DMA-CPU rotating priority */ 

define SRC 0x00100071 /* src = commport O input fifo */ 

define SRC_IDX 0x0 /* src address does not increment */ 

define COUNTER 0x08 /* number of words to transfer */ 

define DST 0x00100042 /* dst = commport 3 output fifo */ 

define DST_IDX 0x0 /* dst address does not increment */ 

define DIEVAL 0x4000 /* set ICRDY3 read sync */ 
DMAUNIF *dma = (DMAUNIF *) DMAADDR; 

int dieval = DIEVAL; 


main() { 


dma->srce = (void *) SRC; 
dma->src_idx = SRC_IDX; 
dma->counter = COUNTER; 


dma->dst = (void *)DST; 
dma->dst_idx = DST_IDX; 
dma->ctrl = (void *)CTRLREG; 


asm(” ldi @_dieval, die”); 
PRIM_WAIT_DMA( (volatile int *)dma); 
} 
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Example 7-7. Unified-Mode DMA Using Autoinitialization (Method 1) 


[BKK KK IKK I KK IK IK RK IK A I I IR I IK IR IK AK AR IR A IK IA A IK IK I OK I IK HK 


EXAMPLE: Unified Mode 
Autoinitialization method 1: 
DMAO in unified mode transfers 8 words from 0x02ffCOO (index 1) to 
Ox02ffd00 (index 1) and then it transfer 4 words from 0x02ffe00 (index 4) 


A AR A A kA A AA A AA A A AA A AA AA A IA A kA A I I / 


to to 0x02fff00 (index 1). No DMA sync transfer is used. 
Autoinitialization method 1 requires N autoinitialization memory blocks 
to transfer N blocks and starts with a DMA transfer counter equals to 0. 


include "dma.h” 
define DMAADDR 0x001000a0 
/* 1st transfer settings */ 
define CTRLREG1 0x00c00009 /* DMA-CPU rotating priority and DMA 
autoinitializes when transfer counter = 0 */ 
define SRC1l Ox002ffc00 /* src address */ 
define SRC1_IDX Oxl /* src address increment */ 
define COUNTERI1 0x08 /* number of words to transfer */ 
define DST1 Ox002ffd00 /* dst address rt 3 output fifo */ 
define DST1_IDX Oxl /* dst address increment */ 
/* 2nd transfer settings */ 
define CTRLREG2 0x00c40005 /* DMA sends interrupt to CPU when transfer 
finishes (TC=1),DMA-CPU rotating priority 
and DMA stops after transfer completes */ 
define SRC2 Ox002ffe00 /* src address */ 
define SRC2_IDX 0x4 /* src address increment */ 
define COUNTER2 0x4 /* number of words to transfer */ 
define DST2 Ox002fff00 /* dst address */ 
define DST2_IDX Oxl /* ast address increment */ 
DMAUNIF *dma = (DMAUNIF *) DMAADDR; 
DMAUNIF autoinil; 
DMAUNIF autoini2; 
main() { 
/* initialize lst set of autoinitialization values */ 
autoinil.sre = (void *)SRC1; 
autoinil.src_idx = SRC1_IDX; 
autoinil.counter = COUNTERI1; 
autoinil.dst = (void *)DSTI1; 
autoinil.dst_idx = DST1_IDX; 
autoinil.linkp = &autoini2; 
autoinil.ctrl = (void *)CTRLREG1; 
/* initialize 2nd set of autoinitialization values ied 


autoini2.sre = (void *)SRC2; 

autoini2.src_idx = SRC2_IDX; 

autoini2.counter = COUNTER2; 

autoini2.dst = (void *)DST2; 

autoini2.dst_idx = DST2_IDX; 

autoini2.ctrl = (void *)CTRLREG2; 

/* initialize DMA (link pointer pointing to lst set of autoinit. values */ 


dma->linkp 
dma->counter 
dma->ctrl 


/* wait for DMA to finish transfer */ 
PRIM _WAIT_DMA( (volatile int *)dma); 


éautoinil; 
O; 


(volatile void *)CTRLREG1; 
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Example 7-8. Unified-Mode DMA Using Autoinitialization (Method 2) 


[RRRRRER 
EXAMPLE : 


KKK KK KKK 


include 
define 


/* Ist t 
define 


define 
define 
define 
define 
define 


/* 2nd t 
define 


define 
define 
define 
define 
define 
DMAUNIF 
DMAUNIF 


main() { 


/* initi 


/* initi 
dma->srce 
dma->src 
dma->cou 
dma->dst 
dma->dst 
dma->lin 
dma->ctr 


/* wait 
PRIM_WAI 
} 


autoini2. 
autoini2. 
autoini2. 
autoini2. 
autoini2. 
autoini2. 


KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK EK 


Unified Mode 

Autoinitialization method 2: 

DMAO in unified mode transfers 8 words from 0x02ffCO0O0O (index 1) 

to O0x02ffd00 (index 1) and then it transfer 4 words from 0x02ffe00 
(index 4) to to O0x02fff00 (index 1). No DMA sync transfer is used 
Autonitialization method 2 requires (N-1) autoinitialization memory 
blocks to transfer N blocks and starts with a DMA transfer counter 
different from 0. 
FER RR RR RR A RR RRR AR RR RRR RRA RAR A I I I I / 


“dma.h” 


DMAADDR 0x001000a0 

ransfer settings */ 

CTRLREG1 0x00c00009 /* DMA-CPU rotating priority and DMA 
autoinitializes when transfer counter = 0 */ 

SRC1 0Ox002ffc00 /* src address */ 

SRC1_IDX Oxl /* src address increment */ 

COUNTER1 0x08 /* number of words to transfer */ 

DST1 Ox002ffd00 /* dst address rt 3 output fifo */ 

DST1_IDX Ox1 /* dst address increment */ 

ransfer settings */ 

CTRLREG2 0x00c40005 /* DMA sends interrupt to CPU when transfer 


finishes (TC=1),DMA-CPU rotating priority 
and DMA stops after transfer completes */ 


SRC2 Ox002ffe00 /* src address */ 

SRC2_IDX 0x4 /* src address increment */ 
COUNTER2 0x4 /* number of words to transfer */ 
DST2 Ox002fff00 /* dst address */ 

DST2_IDX Oxl /* dst address increment */ 

*dma = (DMAUNIF *) DMAADDR; 


autoini2; 


alize 2nd set of autoinitialization values ef 
sre = (void *)SRC2; 

src_idx = SRC2_IDX; 

counter = COUNTER2; 


dst = (void *)DST2; 
dst_idx = DST2_IDX; 
ebrl = (void *)CTRLREG2; 
alize DMA with lst set of autoinitialization values </, 


(void *)SRC1; 
idx = SRC1_IDX; 
nter = COUNTER1; 

= (void *)DSTI1; 
_idx = DST1_IDX; 
kp = &autoini2; 
1 = (void *)CTRLREG1; 


for DMA to finish transfer */ 
T_DMA( (volatile int *)dma); 
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Example 7-9. Split-Mode Auxiliary DMA Using Read Sync 


[KK KK KK A A A A A A A A A A A A A A A A A A A A A A A A A A A A A a a OO oe 


EXAMPLE: Split-mode (AUX only) 
Commport-to-commport transfer: 
DMA 3 Auxiliary channel transfers 8 words from commport 3 to 
commport 0. DMA3 source sync with ICRDY3 is used. 
This example is functionally equivalent to Example 7-7. 
In this program, DMA3 expects data in commport 3 being sent by 


another processor/device. Otherwise no transfer will occur. 
KKK KKKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKKKKKKAKKAKK KK KKK KKK KK 


/ 
include “dma.h” 
define DMAADDR 0x001000d0 
define CTRLREG 0x0309c091 /* DMA Aux sends interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority */ 
define DST 0x00100042 /* dst = commport 3 output fifo */ 
define DST_IDX 0x0 /* dst address does not increment */ 
define DIEVAL 0x4000 /* set ICRDY3 Auxiliar read sync */ 
define ACOUNTER 0x08 /* auxiliar channle counter */ 
DMASPLIT *dma (DMASPLIT *) DMAADDR; 


int dieval = DIEVAL; 


main() { 


dma->dst (void *)DST; 
dma->dst_idx DST_IDX; 
dma->acounter = ACOUNTER; 
dma->ctrl = (void *)CTRLREG; 
asm(” ldi @_dieval,die”); 
AUX_WAIT_DMA( (volatile int *)dma)j; 
} 
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Example 7-10. Split-Mode Auxiliary and Primary Channel DMA 


[BRK RK KK KR RR RR A RR RR RR RR A RR A RRR A RR I OR OR OK 


EXAMPLE: Split-mode (AUX and PRIMARY both running) 
Commport-—to-commport transfer: 
DMA3 prim. channel sends 4 words from memory (0x02ffc00) to 
commport 3 (output FIFO). 
DMA3 aux.channel receives 8 words from commport 3 (input FIFO) 
to memory (0x02ffd00) 
DMA3 prim. channel uses OCRDY3 write sync. 
DMA3 aux. channel uses ICRDY3 read sync. 
In this program, DMA3 aux channel expects data in commport 3 being 
sent by another processor/device. Otherwise no aux channel transfer 


will occur. 
FER RR RR RR RRR RRR AR RR RRR A RAR A I RK I / 


include “dma.h” 
define DMAADDR 0x001000d0 
define CTRLREG Ox03cdc0d5 /* DMA Aux/prim send interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority, read/write sync transfer */ 
define DIEVAL 0x24000 /* set ICRDY3/OCRDY read/write sync */ 
define DST OxO02ffd00 /* auxiliary channel settings */ 
define DST_IDX Ox1 
define ACOUNTER 0x08 
define SRC Ox02ffc00 /* primary channel settings */ 
define SRC_IDX Oxl 
define COUNTER 0x04 
DMASPLIT *dma = (DMASPLIT *) DMAADDR; 
int dieval = DIEVAL; 


main() { 

dma->srec = (void *)SRC; /* primary channel */ 
dma->src_idx = SRC_IDX; 

dma->counter = COUNTER; 

dma->dst = (void *)DST; /* auxiliary channel */ 
dma->dst_idx = DST_IDX; 

dma->acounter = ACOUNTER; 

dma->ctrl = (void *) CTRLREG; 


asm(” ldi @_dieval, die”); 
SPLIT_WAIT_DMA( (volatile int *)dma); 
} 
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Example 7-11. Split-Mode DMA Using Autoinitialization 


[KKK KK A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A He 


EXAMPLE : Split-mode (AUX and PRIMARY both running) 
Autoinitialization example: 
DMA3 aux .channel autoinitializes and THEN receives 4 words from 
commport 3 (input FIFO) to memory (0x02ffd00). 
DMA3 pri.channel sends 4 words from memory (0x02ffc00) to 
commport 3 (output FIFO) and THEN other 2 words from memory 
(OxO02ffc10) with index=2 to commport 3 (output FIFO). 
DMA3 prim. channel uses OCRDY3 write sync. 
DMA3 aux. channel uses ICRDY3 read sync. 
Autoinitialization method 1 is used in all cases. 
In this program, DMA3 aux channel expects data in commport 3 being 
sent by another processor/device. Otherwise no aux channel transfer 
will occur. 

Fe A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A a HH Oe / 


include "dma.h” 


define DMAADDR 0x001000d0 
define CTRLREG1 Ox03cdc0e9 /* DMA aux/prim send interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority, read/write sync transfer */ 
define CTRLREG2 Ox03cdc0d5 /* same as above but transfer finishes */ 
define DIEVAL 0x24000 /* set ICRDY3/OCRDY read/write sync */ 
/* Primary Channel */ 
define SRC1l Ox02ffc00 /* autoinitialization 1 */ 
define SRC1_IDX Ox1 
define COUNTER1 0x04 
define SRC2 Ox02ffc10 /* autoinitialization 2 */ 
define SRC2_IDX Ox2 
define COUNTER2 0x02 
/* Auxiliary channel */ 
define DST1 Ox02ffd00 /* autoinitialization 1 */ 
define DST1_IDX Ox1 
define ACOUNTER1 0x04 
DMASPLIT *dma = (DMASPLIT *) DMAADDR; 
int dieval = DIEVAL; 


DMAPRIM autoinil, autoini2; 
DMAAUX autoiniaux; 


main() { 


/* PRIMARY CHANNEL : 1st autoinitialization values */ 
autoinil.ctrl = (void *)CTRLREG1; 

autoinil.srcec = (void *)SRC1; 

autoinil.src_idx = SRC1_IDX; 

autoinil.counter = COUNTERI; 

autoinil.linkp = géautoini2; 
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Example 7-11. Split-Mode DMA Using Autoinitialization (Continued) 


/* PRIMARY CHANNEL : 2nd autoinitialization values */ 


autoini2.ctrl = (void *)CTRLREG2; 
autoini2.srce = (void *)SRC2; 
autoini2.src_idx = SRC2_IDxX; 
autoini2.counter = COUNTER2; 


/* AUXILIARY CHANNI : Ist autoinitialization values */ 
autoiniaux.ctrl (void *)CTRLREG2; 

autoiniaux.dst = (void *)DST1; 

autoiniaux.dst_idx DST1_IDX; 

autoiniaux.acounter = ACOUNTERI; 


ie 
(rm 


/* initialize DMA */ 


dma->linkp = €autoinil; 
dma->alinkp = &autoiniaux; 
dma->counter = 0; 

dma->acounter = 0; 

dma->ctrl = (void *)CTRLREGI; 


asm(” ldi @ dieval,die”); 


/* wait for DMA to finish transfer */ 
SPLIT_WAIT_DMA( (volatile int *)dma); 
} 
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Example 7-12. Include File for All C Examples (dma.h) 


typedef 


typedef 


typedef 


typedef 


#define 
#define 
#define 


struct dmaunif { 
volatile void *ctrl; 
volatile void *src; 
volatile int src_idx; 
volatile int counter; 
volatile void *dst; 
volatile int dst_idx; 
struct dmaunif *linkp; 
}DMAUNIF; 

struct dmaprim{ 
volatile void *ctrl; 
volatile void *src; 
volatile int src_idx; 
volatile int counter; 
struct dmaprim *linkp; 
}DMAPRIM; 

struct dmaaux{ 
volatile void *ctrl; 
volatile void *dst; 
volatile int dst_idx; 
volatile int acounter; 
struct dmaaux *alinkp; 


volatile void *ctrl; 
volatile void *src; 


volatile int src_idx; 
volatile int counter; 
volatile void *dst; 

volatile int dst_idx; 


struct dmaprim *linkp; 
volatile int acounter; 
struct dmaaux *alinkp; 
} DMASPLIT; 
PRIM_WATIT_DMA (x) 
AUX_WAIT_DMA (x) 
SPLIT_WAIT_DMA (x) 


while 
while 
while 


(( 
(( 
(( 


/* 
/* 
/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


control register * 
source address * 
source address index */ 
transfer counter Fi. 
dest. address */ 
dest. address index */ 
link pointer “yd 
control register La 
prim. src address Jy 
prim. index */ 
prim transfer counter*/ 
link pointer dy 
control register EY. 
aux. dst address * | 
aux. index */ 
aux. transfer counter*/ 
aux. link pointer */ 
control register */ 
prim. src address a 
prim. index ey: 
prim transfer counter*/ 
aux. dst address */ 
aux. index * / 
link pointer ays 
aux. transfer counter*/ 
aux. link pointer */ 


0x00c00000 & *x) !=0x00800000) 
0x03000000 & *x) !=0x02000000) 
Ox03c00000 & *x) !=0x02800000) 
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Using the Communication Ports 


The 'C4x communication ports are very high-speed data transmission circuits. 
Their speed and the close proximity of multiple data lines create special chal- 
lenges. General design rules that are applicable to high-speed (<10ns) 
memory interface design are appropriate for *C4x communication-port inter- 
connections. This chapter provides guidelines for designing communication- 
port interfaces. 
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8.1. Communication Ports 


To provide simple processor-to-processor communication, the ’C4x has six 
parallel bidirectional communication ports. Because these ports have port ar- 
bitration units to handle the ownership of the communication-port data bus be- 
tween the processors, you should concentrate only on the internal operation 
of the communication ports. For software, these communication ports can be 
treated as 32-bit on-chip data I/O FIFO buffers. Processor read data from/write 
data to communication is simple: 


LDI @comm_portO_input,RO ;Read data from comm. port 0 
or 
STI RO, @comm_portO_output ;Write data to comm. port 1 


If the CPU or DMA reads from or writes to the communication-port I/O FIFO 
and the I/O-FIFO is either empty (on a read) or full (on a write), the read/write 
execution will be extended either until the data is available in the input FIFO 
for a read, or until the space is available in the output FIFO for a write. Some- 
times, you can use this feature to synchronize the devices. However, this can 
slow down the processing speed and even hang up the processor. Avoid such 
situations by synchronizing the CPU/DMA accesses with the following flags 
that indicate the status of the port: 


ICRDY (input channel ready) 
= 0, the input channel is empty and not ready to be read. 
= 1, the input channel contains data and is ready to read. 


ICFULL (input channel full) 
= 0, the input channel is not full. 
= 1, the input channel is full. 


OCRDY (output channel ready) 
= 0, the output channel is full and not ready to be written. 
= 1, the output channel is not full and ready to be written. 


OCEMPTY (output channel empty) 
= 0, the output channel is not empty. 
= 1, the output channel is empty. 


Example 8-1 shows the reading of data from the communication port, eight 
data at a time using the CPU ICFULL interrupt. Example 8—2 shows the writing 
of data to acommunication port, one datum at a time using the polling method. 
Both examples show DMA reads/writes. (DMA is discussed in subsection 7.3, 
DMA Assembly Programming Examples on page 7-4. 
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Example 8—1.Read Data from Communication Port With CPU ICFULL Interrupt 


* 
* ITLE READ DATA FROM COMMUNICATION PORT WITH CPU 
* ICFULL INTERRUP 
* 
* HIS EXAMPLE ASSUMES THE ICFULL 0 INTERRUPT VECTOR IS SET IN THE CPU 
* INTERRUPT VECTOR TABLE. THE EIGHT DATA WORDS ARE READ IN 
* WHENEVER THE DATA IS FULL IN COMM PORT 0 INPUT FIFO. 
* 
LDA @COMM_PORTO_CTL, AR2 ;Load comm port 0 control Reg. address 
LDA @COMM_PORTO_INPUT,ARO ;Load comm port O input FIFO address 
LDA @INTERNAL_RAM, AR1 ;Load internal RAM address 
AND3 OF7H, *AR2,R9 ;Unhalt comm port O input channel 
STI R9, *AR2 
OR 04H, IIE ;Enable ICRDY O interrupt 
OR 02000H, ST ;Enable CPU global interrupt 
ICFULLO PUSH ST 
PUSH RS 
PUSH RE 
PUSH RC 
LDI *ARO,R10 ;Read data from comm port O input 
RPTS 6 ;Setup for loop READ 
READ LDI *ARO,R10 ;Read data from comm port O input 
|| Siri R10, *AR1++ (1) ;Store data into internal RAM 
STI R10, *AR1++ (1) ;Store data into internal RAM 
POP RC 
POP RE 
POP RS 
POP or 
RETI 
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Example 8-2. Write Data to Communication Port With Polling Method 


* 

* ITLE WRITE DATA TO COMMUNICATION PORT WITH POLLING METHOD 

* 

‘ HE BIT 8 OF COMMUNICATION PORT 0 CONTROL REGISTER WILL BE 

* SET ONLY WHEN THE OUTPUT FIFO IS FULL. THIS EXAMPLE CHECKS 

* HIS BIT TO MAKE SURE THERE IS SPACE AVAILABLE IN 

* OUTPUT FIFO. 

* 
LDA @COMM_PORTO_CTL, AR2 ;Load comm port 0 control reg address 
LDA @COMM_PORTO_OUTPUT,ARO ;Load comm port O output FIFO address 
LDA @INTERNAL_RAM, AR1 ;Load internal RAM address 
AND3 OEFH, *AR2,R9 ;Unhalt comm port O output channel 
STL R9, *AR2 
LDI D1lOUE, Ro ;Load mask for bit 8 

WAIT: TSTB *AR2,R9 ;Check if output FIFO is full 
BZD WAIT ;If yes, check again 

WRITE_COMM LDI *AR1++(1),R10 ;Read data from internal RAM 
STI R10, *ARO ;Store data into comm port 0 output 
NOP 
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8.2 Signal Considerations 


Because of the bidirectional high-speed protocol used in the ’*C4x communica- 
tion ports, signal quality is extremely important. Poor quality signals can poten- 
tially cause both ends of a communication-port link to become a master. If this 
occurs and one communication port drives a signal request, no response is 
received from the other communication port, and the link hangs. This condition 
remains until both ’C4x devices are reset. If this is not corrected, the commu- 
nication-port drivers can be damaged. 


If poor quality signals are a problem, use circuits to improve impedance match- 
ing. Because the ’‘C4x communication-port output buffer impedance can 
change during signal switching, a conventional parallel termination does not 
help. Serial matching resistors can be added at each end of all communication 
port lines (see Figure 8-1). Serial resistors help match the output buffer im- 
pedance to the line impedance and protect against signal contention caused 
by any potential fault condition. The resistor value, plus buffer output imped- 
ance, should match the line impedance. Results have shown that a lower than 
optimal serial resistor value provides better performance. A resistor value of 
22-33 Q is usually a reasonable start. Some experimentation may be needed 
to reduce ringing effects. A good received signal should have an undershoot 
of 0.5 to 1.0 V or less. A resistor value that is too high results in an under- 
damped falling edge that does not cross the zero logic level and should be 
avoided. 


Figure 8—1. Impedance Matching for ’C4x Communication-Port Design 


Pin as an Output Vv Pin as an Input 
Voc CC 
10 kQ 10 kQ 
Paw tw > 
Rp Rg 

oe 33 2 Z9=50-100 Q 

(Lower than 

levies 


Even though pullup resistors do not help for impedance matching, they are 
recommended at each end to avoid unintended triggering after reset, when 
RESET going low is not received on all ’C4x devices at the same time. 


A pulldown resistor is not desirable, because it increases power consumption, 
does not protect the device from a fault condition, and can cause token loss 
and byte slippage on reset. 
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For jumps to other boards or for long distances, a unidirectional data flow with 
buffering is the preferred method. In this case, use buffers with hysteresis for 
CSTRB and CRDY at each end with delays greater than those in the data bus. 
This has two advantages: it cleans up the signals and helps eliminate glitches 
that can be erroneously perceived as valid control; it also allows the data bits 
to settle before the receiver sees CSTRB going low. 


Interfacing With a Non-’C4x Device 


8.3 Interfacing With a Non-’C4x Device 


To guarantee a correct word transfer operation between a ’C4x communica- 
tion port and a non-’C4x device, the non-’C4x device should mimic the hand- 
shaking operation between CSTRB and CRDY (word transfer), CREQ and 
CACK (token transfer). The token transfer operation is more complex than the 
word transfer operation. It requires tri-stating of pins after different events. 
Sections 8.6 and 8.7 offer examples on how to handle token transfers with 
non-’C4x devices. The word transfer operation is much simpler. The following 
sequence describes the word transfer operation: 


Word transfer operation 


CASE I: The non-’C4x has the token and transmits data. The ’C4x receives data. 


1) 


The non-’C4x device drives the first byte (byte 0) into the CD data lines and 
then drops CSTRB low, indicating new data. There is no need to meet the 
maximum timing requirements, but the data should be valid before 
CSTRB goes low. 


The non-’C4x device waits for the ’C4x to respond with CRDY low and then 
can immediately drive the next data byte and bring CSTRB high. 


The non-’C4x device waits for CRDY to be high; then, steps 1, 2, and 3 
repeat for bytes 1-3. 


After byte 3 is transmitted, the non-’C4x device can leave the byte 3 value 
in the CD lines until a new word is sent. 


In ’C4x device revisions lower than 3.0, CSTRB should go high after re- 
ceiving CRDY low no later than one ’C4x H1/H3 cycle between word 
boundaries. See Section 8.9, Implementing a CSTRB Shortener Circuiton 
page 8-17, foranimplementation of a CSTRB shortener circuit. In ’C-4x de- 
vice revisions 3.0 or higher, no CSTRB width restriction exists. 


The non-’C4x device can drive CSTRB low for the next word at any time 
after receiving CRDY high from the last byte. There is no reason to wait 
for the internal ’C4x synchronizer between CRDY low and CSTRB low for 
the next word to finish. 


CASE Il: The ’C4x has the token and transmits data. The non-’C4x device re- 
ceives data. 


1) 


After receiving CRSTB low from the ’C4x, indicating new data valid, the 
non-’C4x device can immediately read the data byte and then drive CRDY 
low, indicating that the byte has been read. There is no maximum time limit 
between these two events. 


The non-’C4x device then waits to receive CSTRB high and can immedi- 
ately drive CRDY high, ending the byte transfer operation. 
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8.4 Terminating Unused Communication Ports 
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To avoid unintended communication port triggering, you can terminate unused 
communication-port control lines in one of the following ways: 


Lj Use pullup resistors in all the communication-port control lines. Pullups in 
data lines of input communication ports are optional, but they lower power 
consumption. Pullups in data lines of output communication ports are not 
required; if used, they increase power consumption. 


(J Tie the control lines together on the same communication port, that is, 
CSTRB to CRDY and CREQ to CACK. This holds the control inputs high 
without using external pullup resistors. 


8.5 Design Tips 


Design Tips 


(1 Becareful with different voltage levels when running multiple ’C4x devices 
(or any other CMOS device) from different power supplies. This can create 
a CMOS latch-up that can permanently damage your device. Adding serial 
resistors to 'C4x communication ports connecting devices in different 
boards marginally helps to protect communication-port drivers. It is rec- 
ommended that all ’C4x devices in the system remain in reset until power 
supplies are stable. 


1 Sometimes, itis beneficial to keep the line impedance as high as possible. 
This helps when interfacing to external cables. Typical ribbon cable im- 
pedance is about 100 Q. 


Because it is sometimes difficult to route high-impedance lines (especially 
long ones) in a circuit board, use an external ribbon cable to jump over the 
length of a board. In this case, only two headers should be installed in the 
circuit board. 


(1 Use an alternating signal and ground scheme. This helps control differen- 
tial signal coupling and impedance variation. For quality signals, use a 
26-wire ribbon ((4 control + 8 data + 1 shield) * 2 = 26). The shield is need- 
ed for the signal that is otherwise on the edge. 


Do not route signals on top of each other. When it is necessary to cross 
traces on adjacent layers, cross them at right angles to reduce coupling. 


—ewoewrtNaqgaleaeeeeee ee eS ee SS SS SSS SSS SSS SS SNS SS SSS SSNS 
Note: 


Because the ’C4x communication ports are very high-speed data transmis- 
sion circuits, signal quality is very important. A poor quality signal can cause 
the missing or slipping of a byte. If this happens, the only solution is a ’C4x 
reset. Because at reset communication ports 0,1, and 2 are transmitters and 
3, 4, and 5 are receivers, a safe reset requires resetting of every ’C4x con- 
nected to the ’C4x with the faulty condition. Global reset becomes a neces- 


sity. 
Cd 
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8.6 Commport to Host Interface 


8.6.1 
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A host interface between a ’C4x comport and a PC’s bidirectional printer port 
has many advantages including freeing up the DSP bus and treating the host 
PC as a virtual 'C4x node within a system of ’C4x devices. 


This interface uses a bidirectional PC printer port interface. Logic circuits, buff- 
ers and resistors convert logic control levels driven from the printer port into 
’C4x commport control signals. Signals driven from the ’C4x are converted into 
status signals, which can be polled in software by the PC. In addition, the PC’s 
printer port provides the byte-wide data path into and out of the PC. 


You can use this I/O interface for host-data communication, bootloading, and 
debug operations. With proper buffering and software control, it is also pos- 
sible to build long and reliable links. The speed is primarily dependent on the 
speed of the host. When using a PC as the host, the speed is limited by the 
PC’s I/O channel speed. If higher rates are needed, use a memory-mapped 
version of the printer port in the PC. 


The printer port used to test this circuit was the DSP-550 from STB Systems, 
but there are other bidirectional printer ports on the market. Using the STB card 
in the bidirectional mode requires that a jumper be set (See your manual). 
Then, if a 1 is written to bits 5 or 7 of the control register (this depends on your 
printer port), data can be read back from the data register. 


Simplified Hardware Interface for ’C40 PG = 3.3, or C44 devices 


Figure 8—2 shows a simplified commport signal splitter that splits each comm- 
port control signal into a simple drive and sense pair of signals. Simplified, in 
this case, means that, though the circuit is easy to follow functionally and will 
operate, it is not the preferred solution (see the improved driver in Figure 8—3). 
The signals in this circuit can be easily buffered without risk of driver conflicts. 
However, keep a few things in mind about the simplified design: 


_j Due to commport-control signal restrictions in earlier silicon revisions this 
circuit will not work with the TMS320C40 PG 3.0 or lower. 


Lj This circuit requires a bidirectional printer port. 


uu 


Standard printer-port cables often do not provide ‘clean’ signals 


Lj Ahigh value is needed for the isolation resistor in order to keep the current 
levels during signal opposition to a minimum. But, a low value is needed 
for the isolation resistor in order to insure reasonably fast rise and fall times 
of the commport control signals when they are inputs. This conflict can be 
overcome by carefully picking the correct resistor values or by adding 
additional biasing. 


Figure 8-2. Better Commport Signal Splitter 
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D7 
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Legend: Rp =470 ohms R, = 180 ohms 
Rg = 47 ohms Ry = 220 ohms 
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8.6.2 Improved Drive and Sense Amplifiers 


Two improvements are suggested for the interface described above. The 
improvements are described in Figure 8-3. 


Figure 8-3. Improved Interface Circuit 


sense 
sense 
Ry 
Clock 
me < ’C4x 
Parallel V 
Port — comm port 
R 
p 
A 
drive Be 
® © 
R R 
RS—232 2 
driver A 
Re 
\| — 
/| 
Cy 
Legend Rp = 470 ohms Ro = 10 K ohms Cz = 100 pF 
Ry =1Kohms Rg = 50 ohms 


The first improvement is that the signals going to and from the printer port are 
synchronized using a clock and a simple data latch. By taking samples in time, 
noise which may be able to corrupt the first sample of a transistion will probably 
not be enough to corrupt the next sample. By adding a hysteris loop made from 
resistors R1 and R2, the noise immunity is improved more. Capacitor C1 is an 
additional analog filter that rejects high-frequency noise. 


The next major improvement is the use of a current driver in place of the isola- 
tion resistor. In this case, an RS232 driver is used; this driver can drive beyond 
the supply rails of the DSP and has a built-in current limit of about 20mA. 
Diodes D1 and D2, along with R3, clamp the resulting signal to the supply rails 
of the DSP and latch to prevent excessive overdrive. The DSP and latch both 
have internal clamping diodes, but it is not recommended that you rely on them 
as the internal clamp diodes are not intended for this purpose. 
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8.6.3 How the Circuit Works 


The PC can drive any value on the control lines, independent from the returned 
status. If alogic 1 is driven into the drive side of the isolation resistor and a logic 
0 is observed on the sense side, the ’C4x commport signal under question is 
without a doubt an output. 


By then driving levels and polling the returned status, it is possible to synchro- 
nize a host processor to the state machine of the ’C4x commport. The advan- 
tage of this design is that it can be easily ported to any smart processor with 
any basic I/O capability. For example, TMS320C31/32 devices have been 
used as slave devices that are bootloaded from a commport and then used as 
serial ports with internal memory and additional processing capabilities. Com- 
plicated and risky ASIC designs are not required and the solution is fully pro- 
grammable. 


You must include current limiting circuitry when designing any 


*C4x interface. If the current is not limited, it can exceed 100 mA per 
pin, which can damage a device. 


8.6.4 The Interface Software 


The interface software for this host interface is available through the TI BBS 
(filename: M4x_2.exe). This file contains not only the low-level software driv- 
ers, but also extra code for the M4x (a multiprocessor ’C4x communication ker- 
nel) applications note. The following files are contained in this application: 


M4X Debugger (no source code) 

MEMVIEW memory and communications matrix view and edit utility 
MANDEL40 multiprocessor Mandelbrot demonstration program 
M4X.ASM multiprocessor TMS320C4x communications kernel 
DRIVER.CPP higher level system functions 

TARGET.CPP getmem, putmem, run, stop and singlestep commands 
OBJECT.CPP source code for using the printer port interface 


OUUUOUULU 
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8.7 An I/O Coprocessor-’C4x Interface 


This section presents a software-based interface that provides a ’C4x with a 
flexible bidirectional interface to a TMS320C32. The ’C32 acts as a smart I/O 
coprocessor that can provide AIC interfacing and data preprocessing among 
others. The ’C32 is an inexpensive and flexible solution. 


Some of the advantages of using an I/O coprocessor include: 
[J An I/O coprocessor can provide with data-processing. 


Li An I/O coprocessor allows for error correction and recovery from ’C4x 
commport interface problems. 


[1 An I/O coprocessor can buffer data, allowing faster ’C4x data throughput. 


Figure 8—4 shows the ’C32-to-’C4x interface. Through the interface, a 'C4x 
commport is memory-mapped to the C32 external memory bus. The interface 
uses four ’C32 I/O pins to drive the commport control signals. 


Figure 8-4. A 'C32 to ’'C4x Interface 
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CRDY 
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D7 


Pullup resistors in the XFO, XF1, TCLKO and TCLK1 lines are used to prevent 
undesired glitches due to temporary high-impedance conditions. Serial resis- 
tors are also used on the same pins for better impedance matching. 


The interface software drivers and a more detailed explanation of the interface 
can be obtained from our TI BBS (filename 4xaic.exe). Token transfer and 
word transfer drivers are included with the software. 


Implementing a Token Forcer 


8.8 Implementing a Token Forcer 


After system reset, half of the communication channels associated with a par- 
ticular ’C4x have token ownership (communication ports 0, 1, 2), and the other 
half (communication ports 3, 4, 5) do not. 


If, because of system configuration requirements, communication port direc- 
tion must to be changed, the circuits shown in Figure 8—5 and Figure 8-6 can 
be used. The circuits force the token to be passed and communication port 
direction to remain changed. 


Even though these circuits are intended to force a change of the original com- 
munication port direction after reset, they can be used also to maintain the orig- 
inal direction. However, this can be more conveniently achieved using pullups 
in CACK and CREQ. The pullups prevent any damage to the communication 
ports in the event of a program error that writes into a port configured as an 
input. 


Forcing a communication port to become an output port 


Figure 8—5 shows a circuit that forces a communication port to become an out- 
put port. In this circuit, driving the CACK line with the CREQ line reconfigures 
an input port as an output port. When a word is written to the FIFO, CREQ is 
driven low, indicating a token request. After a synchronizer delay of 1 to 2 
cycles (U1 and U2), CACK is driven low, indicating a token acknowledge. 
CREQ then goes active high and then is held high by Rp as the line switches 
to an input. The CLK signal can be any clock with a frequency equal to or lower 
than the H1/H8 clock. 


The synchronizer delay is important. If no delay is provided, the CREQ line will 
not be ready to change to an input high condition. As a result, the CACK line, 
which, at this point, is a delayed version of CREQ, is inverted and applied to 
the CREQ line. This results in an oscillation until the synchronizer period has 
timed out. 


Figure 8-5. A Token Forcer Circuit (Output) 
Voc 
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Forcing a communication port to become an input port 


Figure 8-6 shows a circuit that forces a communication port to become an in- 
put port. In this circuit, driving the CREQ line with an inverted CACK reconfi- 
gures an input port as an output. If CREQ is an input, it is held low through Rg 
whenever CACK is high or floating high because of Rp. The port then responds 
to this request by driving CACK low, which, in turn, drives CREQ high, finishing 
the token acknowledge. As in Figure 8—5, synchronizer delays mimic the re- 
sponse of another ’C4x communication port to prevent oscillation. 


Figure 8-6. Communication-Port Driver Circuit (Input) 
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Note that after the port has been reconfigured as an input port, the CREQ line 
is active high while the output of the inverter is low. This causes a constant cur- 
rent flow from CREQ to the inverter. 


Implementing a CSTRB Shortener Circuit 


8.9 Implementing a CSTRB Shortener Circuit 


In ’C40 device revisions lower than 3.0, the width of the CSTRB low pulse be- 
tween word boundaries should not exceed 1.0 H1/H3 at the receiving end. A 
CSTRB low beyond the synchronization period on a word boundary can be 
recognized as a new valid CSTRB, resulting in an extra byte reception (byte 
slippage). For a short distance between two communicating ’C4x devices, 
byte slippage is not a problem. In C40 device revisions 3.0 or higher, or in any 
revision of the C44, no CSTRB width restriction exists. 


The circuit shown in Figure 8—7 can reduce the width of CSTRB for very long 
distances when you are using ’C4x device revisions lower than 3.0. The circuit 
has buffers for CSTRB and CRDY on the transmitting end and two S-R flip- 
flops on the receiving end. On the receiving end, a low STRB incoming signal 
causes the Q signal of S-R flip-flop U1 to go low, forcing the CSTRB pin to go 
low. When CRDY responds with a low signal, S-R flip-flop U2 drives the RDY 
signal low. Because RDY is also tied to the S input of U1, and S has prece- 
dence over Rin an S-R flip-flop, Qin U1 goes high. Also, STRB is inverted and 
drives the S input of U2. In this way, the width of the local CSTRB is shortened, 
regardless of the channel length. When the STRB signal goes back high, the 
S-R flip-flop pair is ready to receive another CSTRB. 


Figure 8—7. CSTRB Shortener Circuit 


CSTRB 


’C4x Transmitter 


CRDY 


CSTRB 


Same Circuit 


’C4x Receiver 


| 
| 
| 
| 
| 
| 
on Both Sides | 
| 
| 
| 
| 
| 


Rs ee 
4702 CRDY 
| 4 
| 
[i =] 
Send Active Circuit Element Receive Active Circuit Element 


Using the Communication Ports 8-17 


Parallel Processing Through Communications Ports 


8.10 Parallel Processing Through Communication Ports 


The ’C4x communication ports are key to parallel processing design flexibility. 
Many processors can be linked together in a wide variety of network configura- 
tions. In this section, Figure 8-8 illustrates 'C4x parallel processing connectiv- 
ity networks that are used to fulfill many signal processing system needs. 


Figure 8-8. 'C4x Parallel Connectivity Networks 
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Figure 8-8. ’C4x Parallel Connectivity Networks (Continued) 
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4-D Hypercube 
Fully-Connected Network A more general-purpose structure. 


According to memory interface, ’'C4x parallel system architecture can be clas- 
sified in three basic groups: 
[1 Shared-Memory Architecture: shares global memory among processors. 


_j Distributed-Memory Architecture: each processor has its own private local 
memory. Interprocessor communication is via ’*C4x communication ports. 


[1 Shared- and Distributed-Memory Architecture: each processor has its 
own local memory but also shares a global memory with other processors. 


Figure 8-8 shows examples of these basic groups. 
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8.11 Broadcasting Messages From One ’C4x to Many ’C4x Devices 
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Message broadcasting from one ’C4x to many ’C4x devices requires a simple 
interface. However, try to avoid signal analog delays caused by distance differ- 
ences between the ’C4x master and the ’C4x slave processor. These delays 
could create bus contention in the CSTRB and CRDY lines. Figure 8-9 shows 
the block diagram of a multiple processor system. In this design, one ’C4x is 
the dedicated transmitter, and three ’'C4x devices are dedicated receivers. No 
reset circuitry is needed, because the transmitter is communication port 0, and 
the receivers are communication ports 3, 4, and 5. At reset, ’C4x communica- 
tion ports 0, 1, and 2 are output ports, and communication ports 3, 4, and 5, 
are input ports. 


Because the communications configuration is fixed, no token transfer is need- 
ed; this allows the CREQ and CACK pins of all processors to be individually 
pulled up to 5 volts through 22-kQ resistors. 


In all cases, each CSTRB should be individually buffered to ensure that line 
reflections do not corrupt each received CSTRB signal. The data pins CD7-0 
of intercommunicating ’C4x devices can be tied together. In general, for fewer 
than three receivers and distances shorter than six inches, data skew relative 
to CSTRB is not a problem, and data buffering is not needed. However, if more 
than three receivers must be driven by a single transmitter or the distance is 
more than six inches, both the CSTRB and CD7-0 lines must be buffered. 


The CRDY signal input is generated by ORing the RDY outputs of all of the 
receiver communication ports. The transmitter should not receive a RDY sig- 
nal until the receiver has received all data. 


In addition, to ensure that the dedicated receiver ’C4x devices do nottry to arbi- 
trate for the communication-port bus, you should halt the output ports of the 
receiver ’C4x devices by setting bit four of their communication-port control 
registers to one. 


Broadcasting Messages From One ’C4x to Many ’C4x Devices 


Figure 8—9. Message Broadcasting by One 'C4x to Many ’C4x Devices 
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’C4x Power Dissipation 


The power-supply current requirement (Ipp) of the ’C4x vary with the specific 
application and the device program activity. The maximum power dissipation 
of a device can be calculated by multiplying Ipp with Vpp (power supply volt- 
age requirement). Both parameters are provided in the ’C4x data sheet. Addi- 
tionally, due to the inherent characteristics of CMOS technology, the current 
requirements depend on clock rates, output loadings, and data patterns. 


This chapter presents the information you need to determine power-supply 
current requirements for the ’C4x under various operating conditions. After 
you make this determination, you can then calculate the device power dissipa- 
tion, and, in turn, thermal management requirements. 
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Capacitive and Resistive Loading 


9.1 Capacitive and Resistive Loading 


In CMOS devices, the internal gates swing completely from one supply rail to 
the other. The voltage change on the gate capacitance requires a charge 
transfer, and therefore causes power consumption. 


The required charge for a gate’s capacitance is calculated by the following 
equation: 


Qaate = VoD X Cgate (coulombs) 

where: 

Qgate is the gate’s charge, 

Vpp is the supply voltage, and 

Cgate is the gate’s capacitance. 

Since current is coulombs per second, the current can then be obtained from: 
| = coul/s = Vpp X Cgate x Frequency 

where: 

/is the current. 


For example, the current consumed by an 80-pF capacitor being driven by a 
10-MHz CMOS level square wave is calculated as follows: 


| = 5(volts) x 80 x 10-12(farads) x 10 x 106(charges/s) 


4mA @ 10 MHz 


Furthermore, if the total number of gates in a device is known, the effective 
total capacitance can be used to calculate the current for any voltage and fre- 
quency. For a given CMOS device, the total number of gates is probably not 
known, but you can solve for a current at a particular frequency and supply volt- 
age and later use this current to calculate for any supply voltage and operating 
frequency. 

ldevice = VDD X Gtotal X foLkK 


where: 
device IS the current consumed by the device, 
Crotal is the total capacitance, and 


IoLK is the clock cycle. 


Capacitive and Resistive Loading 


Solving for power (P = V x /), the equation becomes: 


Paevice = Vop* X Ciotal X foLK 


where: 
Pdevice is the power consumed by the device. 


In this case, Cigza; includes both internal and external capacitances. Cia; can 
be effectively reduced by minimizing power-consuming internal operation and 
external bus cycles. Bipolar devices, pullup resistors and other devices con- 
sume DC power that adds a constant offset unaffected by fo, x. The effect of 
these DC losses depends on data, not frequency. This document assumes an 
all-CMOS approach in which these effects are minimal. 


Another source of power consumption is the current consumed by a CMOS 
gate when it is biased in the linear region. Typically, if a gate is allowed to float, 
it can consume current. Pullups and pulldowns of unused pins are therefore 
recommended. 
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9.2 Basic Current Consumption 


9.2.1 


Generally, power supply current requirements are related to the system—for 
example, operating frequency, supply voltage, temperature, and output load. 
In addition, because the current requirement for a CMOS device depends on 
the charging and discharging of node capacitance, factors such as clocking 
rate, output load capacitance, and data values can be important. 


Current Components 


The power supply current has four basic components: 
LJ Quiescent 

Lj Internal operations 

_j Internal bus operations 

_j External bus operations 


9.2.2 Current Dependency 


The power supply current consumption depends on many factors. Four are 
system related: 

J Operation frequency 

Li Supply voltage 

1 Operating temperature 

[1 Output load 


Several others are related to TMS320C4x operation: 
Duty cycle of operations 

Number of buses used 

Wait states 

Cache usage 

Data value 


UUUUU 


You can calculate the total power supply current requirement for a’C4x device 
by using the equation below, which comprises the four basic power supply cur- 
rent components and three system-related dependencies described above. 


hotal = ('g + hops + bus + kbus) X F x Vx T 
where: 

kotay iS the total supply current, 

iy is the quiescent current component, 

lops is the current component due to internal operations, 


bus is the current component due to internal bus usage, including data value 
and cycle time dependency, 


Basic Current Consumption 


kbug is the current component due to external bus usage, including data value 
wait state, cycle time, and capacitive load dependency, 


Fis a scale factor for frequency, 
Vis a scale factor for supply voltage, and 
Tis a scale factor for operating temperature. 


This report describes in detail the application of this equation and determina- 
tion of all the dependencies. The power dissipation measurements in this re- 
port were taken using a’C40 PG 3.X running at speeds up to 50 MHz and at 
a voltage level of 5 V. 


The minimum power supply current requirement is 130 mA. The typical current 
consumption for most algorithms is 350 mA, as described in the TMS320C4x 
data sheet, unless excessive data output is being performed. 


The maximum current requirement for a ’C4x running at 50 MHz is 


850 mA and occurs only under worst case conditions: writing 
alternating data (AAAA AAAA to 5555 5555) out of both external 
buses simultaneously, every cycle, with 80 pF loads. 


9.2.3 Algorithm Partitioning 


Each part of an algorithm has its own pattern with respect to internal and exter- 
nal bus usage. To analyze the power supply current requirement, you must 
partition an algorithm into segments with distinct concentrations of internal or 
external bus usage. Analyze each program segment to determine its power- 
supply current requirement. You can then calculate the average power supply 
current requirement from the requirements of each segment of the algorithm. 


9.2.4 Test Setup Description 


All TMS320C4x supply current measurements were performed on the test 
setup shown in Figure 9—1. The test setup consists of a TMS320C40, capaci- 
tive loads on all data and address lines, but no resistive loads. A Tektronix digi- 
tal multimeter measures the power supply current. Unless otherwise specified, 
all measurements are made at a supply voltage of 5 V, an input clock frequency 
of 50 MHz, a capacitive load of 80 pF, and an operating temperature of 25°C. 
Note that the current consumed by the oscillator and pullup resistors does not 
flow through the current meter. This current is considered part of the system’s 
resistive loss (see section 9.1, Capacitive and Resistive Loading). 
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Figure 9-1. Test Setup 
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9.3 Current Requirement of Internal Components 


9.3.1 


9.3.2 


Quiescent 


The power-supply current requirement for internal circuitry consists of three 
components: quiescent, internal operations, and internal bus operations. 
Quiescent and internal operations are constants, whereas the internal bus 
operations component varies with the rate of internal bus usage and the data 
values being transferred. 


The quiescent requirement for the TMS320C4x is 130 mA while in IDLE. 
Quiescent refers to the baseline supply current drawn by the TMS320C4x dur- 
ing minimal internal activity. Examples of quiescent current include: 


_j) Maintaining timer and oscillator 


Lj Executing the IDLE instruction 


1 Holding the TMS320C4x in reset 


Internal Operations 


Internal operations include register-to-register multiplication, ALU operations, 
and branches, but not external bus usage or significant internal bus usage. In- 
ternal operations add a constant 60 mA above the quiescent requirement, so 
that the total contribution of quiescent and internal operation is 190 mA. Note, 
however, that internal and/or external program operations executed via an 
RPTS instruction do not contribute an internal operations power supply current 
component. During an RPTS instruction, program fetch activity other than the 
instruction being repeated is suspended; therefore, power-supply current is 
related only to the data operations performed by the instruction being 
executed. 
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Figure 9-2. Internal and Quiescent Current Components 
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Internal Bus Operations 


The internal bus operations include all operations that utilize the internal buses 
extensively, such as internal RAM accesses every cycle. No distinction is 
made between internal reads or writes, such as instruction or operand fetches 
from internal memory, because internally they are equal. Significant use of 
internal buses adds a data-dependent term to the equation for the power sup- 
ply current requirement. Recall that switching requires more current. Hence, 
changing data at high rates requires higher power-supply current. 


Pipeline conflicts, use of cache, fetches from external wait-state memory, and 
writes to external wait-state memory all affect the internal and external bus 
cycles of an algorithm executing on the TMS320C4x. Therefore, you must 
determine the algorithm’s internal bus usage in order to accurately calculate 
power supply current requirements. The TMS320C4x software simulator and 
XDS emulator both provide benchmarking and timing capabilities that help you 
determine bus usage. 


Current Requirement of Internal Components 


Figure 9-3. Internal Bus Current Versus Transfer Rate 
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The current resulting from internal bus usage varies linearly with transfer rates. 
Figure 9-3 shows internal bus-current requirements for transferring alternat- 
ing data (AAAA AAAAh to 5555 5555h) at several frequencies. Note that trans- 
fer rates greater than the TMS320C4x’s MIPS rating are possible because of 
internal parallelism. 


The data set AAAA AAAAh to 5555 5555h exhibits the maximum internal bus 
current for data transfer operations. The current required for transferring other 
data patterns may be derated accordingly, as described later in this subsec- 
tion. 


As the transfer rate decreases (thatis, transfer-cycle time increases) the incre- 
mental Ipp approaches 0 mA. This figure represents the incremental Ipp due 
to internal bus operations and is added to quiescent and internal operations 
current values. 


For example, the maximum transfer rate corresponds to three accesses every 
cycle (one program fetch and two data transfers) or an effective one-third H1 
transfer cycle time. At this rate, 178 mA is added to the quiescent (130 mA) 
and internal operation (60 mA) current values for a total of 368 mA. 
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Figure 9-3 shows the internal bus current requirement when transferring As 
followed by 5s for various transfer rates. Figure 9-4 shows the data depen- 
dence of the internal bus-current requirement when the data is other than As 
followed by 5s. The trapezoidal region bounds all possible data values trans- 
ferred. The lower line represents the scale factor for transferring the same 
data. The upper line represents the scale factor for transferring alternating 
data (all Os to all Fs or all As to all 5s, etc.). 


The possible permutation of data values is quite large. The term relative data 
complexity refers to a relative measure of the extent to which data values are 
changing and the extent to which the number of bits are changing state. There- 
fore, relative data complexity ranges from 0, signifying minimal variation of 
data, to a normalized value of 1, signifying greatest data variation. 


Figure 9-4. Internal Bus Current Versus Data Complexity Derating Curve 
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If a statistical knowledge of the data exists, Figure 9-4 can be used to deter- 
mine the exact power supply requirement on the basis of internal bus usage. 
For example, Figure 9-4 indicates a 89.5% scale factor when all Fs 
(FFFF FFFFh) are moved internally every cycle with two accesses per cycle 
(80 Mbytes per second). Multiplying this scale factor by 178 mA (from 


Current Requirement of Internal Components 


Figure 9-3) yields 159 mA due to internal bus usage. Therefore, an algorithm 
running under these conditions requires about 349 mA of power supply current 
(130 + 60 + 159). 


Since a statistical knowledge of the data may not be readily available, a nomi- 
nal scale factor may be used. The median between the minimum and maxi- 
mum values at 50% relative data complexity yields a value of 0.93 and can be 
used as an estimate of a nominal scale factor. Therefore, this nominal data 
scale factor of 93% can be used for internal bus data dependency, adding 
165.5 mA to 130 mA (quiescent) and 60 mA (internal operations) to yield 355.5 
mA. As an upper bound, assume worst case conditions of three accesses of 
alternating data every cycle, adding 178 mA to 130 mA (quiescent) and 60 mA 
(internal operations) to yield 368 mA. 
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9.4 Current Requirement of Output Driver Components 


9-12 


The output driver circuits on the TMS320C4x are required to drive significantly 
higher DC and capacitive loads than internal device logic drivers. Because of 
this, output drivers impose higher supply current requirements than other sec- 
tions of circuitry in the device. 


Accordingly, the highest values of supply current are exhibited when external 
writes are being performed at high speed. During read cycles, or when the 
external buses are not being used, the TMS320C4x is not driving the data bus; 
this eliminates a significant component of the output buffer current. Further- 
more, in many typical cases, only a few address lines are changing, or the 
whole address bus is static. Under these conditions, an insignificant amount 
of supply current is consumed. Therefore, when no external writes are being 
performed or when writes are performed infrequently, current due to output 
buffer circuitry can be ignored. 


When external writes are being performed, the current required to supply the 
output buffers depends on several considerations: 


_j Data pattern being transferred 
Lj Rate at which transfers are being made 


Lj Number of wait states implemented (because wait states affect rates at 
which bus signals switch) 


[1 External bus DC and capacitive loading 


External bus operations involve external writes to the device and constitute a 
major power-supply current component. The power supply current for the 
external buses, made up of four components, is summarized in the following 
equation: 


kbus = ( base local + local ) + ( base global + global 
where: 


base local/global is the current consumed by the internal driver and pin capaci- 
tance, 


local is the local bus current component, and 
{global is the global bus current component. 


The remainder of this section describes in detail the calculation of external bus 
current requirements. 
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SS a 


Note: 


The DMA current component (/pjya) and communication port current compo- 
nent (/cp) should be included in the calculation of pug ifthey are used in the 


operations. 
a) 


9.4.1. Local or Global Bus 


The current due to bus writes varies with write cycle time. As discussed in the 
previous section, to obtain accurate current values, you must first determine 
the rate and timing for write cycles to external buses by analyzing program 
activity, including any pipeline conflicts that may exist. To do this, you can use 
information from the TMS320C4x emulator or simulator as well as the 
TMS320C4x User’s Guide. In your analysis, you must account for effects from 
the use of cache, because use of cache can affect whether or not instructions 
are fetched from external memory. 


When evaluating external write activity ina given program segment, you must 
consider whether or not a particular level of external write activity constitutes 
significant activity. If writes are being performed at a slow enough rate, they 
do not impact supply current requirements significantly and can be ignored. 
This is the case, however, only if writes are being performed at very slow rates 
on either the local or global bus. 


When bus-write cycle timing has been established, Figure 9-5 can be used 
to determine the contribution to supply current due to bus activity. Figure 9-5 
shows values of current contribution from the local or global bus for various 
transfer rates. This data was gathered when alternating values of 555555555h 
and AAAAAAAAh were written at a capacitive load of 80 pF per output signal 
line. This condition exhibits the highest current values on the device. The val- 
ues presented in the figure represent the incremental current contributed by 
the local or global bus output driver circuitry under the given conditions. Cur- 
rent values obtained from this graph are later scaled and added to several 
other current terms to calculate the total current for the device. As indicated 
in the figure, the lower limit base = Ig + hops + libus is essentially kota; for transfer 
rates less than 1 Mword/second. 
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Figure 9-5. Local/Global Bus Current Versus Transfer Rate and Wait States 
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Figure 9-5 demonstrates a feature of the ’C4x’s external bus architecture 
known as a posted write. In general, data is written to a latch (or a one deep 
FIFO) and held by the bus until the bus cycle is complete. Since the CPU may 
not require that bus again for some time, the CPU is free to perform operations 
on other buses until a conflict occurs. Conflicts include DMA, a second write, 
or a read to the bus. 


In Figure 9—5, the upper line is applicable when STI || STI is not dominated by 
execution of internal NOPs and the external wait state is equal to zero. The 
lower line shows when STI || STI is internally stalled while waiting for the exter- 
nal bus to go ready because of wait states. The addition of NOPs between 
successive ST] || STI operations contributes to internal bus current and there- 
fore does not result in the lowest possible current. 
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Figure 9-6. Local/Global Bus Current Versus Transfer Rate at Zero Wait States 


420 


Ipp (mA) 


220 


180 


140 


Table 9-1. Wait State 


Transfer Rate (Mword/second) 


To further illustrate the relationship of current and write cycle time, Figure 9-6 
shows the characteristics of current for various numbers of cycles between 
writes for zero wait states. The information on this graph can be used to obtain 
more precise values of current whenever zero wait states are used. Table 9-1 
lists the number of cycles used for software generated wait states. 


Timing Table 


Wait State Read Cycles Write Cycles 
0 1 2 
1 2 3 
2 3 4 
3 4 5 


Once a current value has been obtained from Figure 9—5 or Figure 9-6, this 
value can be scaled by a data dependency factor if necessary, as described 
on page 9-16. This scaled value is then summed along with several other cur- 
rent terms to determine the total supply current. 
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9.4.2 DMA 


Using DMA to transfer data consumes power that is data dependent. The cur- 
rent resulting from DMA bus usage (/pjya) varies linearly with the transfer rate. 
Figure 9—7 shows DMA bus current requirements for transferring alternating 
data (AAAA AAAAh to 5555 5555h) at several transfer rates; it also shows that 
current consumption increases when more DMA channels are used. However, 
as more DMA channels are used, the incremental change in current dimi- 
nishes as the internal DMA bus becomes saturated. Note that DMA current is 
superimposed over fipyg (internal bus) value. 


Figure 9—7. DMA Bus Current Versus Clock Rate 
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9.4.3 Communication Port 


9-16 


Communication port operations add a data-dependent term to the equation for 
the current requirement. The current resulting from communication port opera- 
tion (icp) varies linearly with the transfer rate. Figure 9-8 shows communica- 
tion port operation current requirements for transferring alternating data 
(AAAA AAAA to 5555 5555h) at several transfer rates; it also shows that cur- 
rent consumption increases when more communication port channels are 
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used. Similar to the DMA bus current consumption, adding communication 
ports eventually saturates the peripheral bus as more channels are added. 


Figure 9-8. Communication Port Current Versus Clock Rate 
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Note that since the communication ports are intended to communicate with 
other TMS320C4x communication ports over short distances, no additional 
capacitive loading was added. In this case, the transmission distance is about 
6 inches without additional 80- pF loads. Note that communication port current 
is superimposed over Ip,5 value. 


9.4.4 Data Dependency 


Data dependency of the current for the local and global buses is expressed as 
a scale factor that is a percentage of the maximum current exhibited by either 
of the two buses. 
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Figure 9-9. Local/Global Bus Current Versus Data Complexity 
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Figure 9-9 shows normalized weighting factors that can be used to scale cur- 
rent requirements on the basis of patterns in data being written on the external 
buses. The range of possible weighting factors forms a trapezoidal pattern 
bounded by extremes of data values. As the figure shows, the minimum cur- 
rent occurs when all zeros are written, while the maximum current occurs when 
alternating 5555 5555h and AAAA AAAAD are written. This condition results 
in a weighting factor of 1, which corresponds to using the values from 
Figure 9—5 and/or Figure 9-6 directly. 


As with internal bus operations, data dependencies for the external buses are 
well defined, but accurate prediction of data patterns is often either impossible 
or impractical. Therefore, unless you have precise knowledge of data patterns, 
you should use an estimate of a median or average value for the scale factor. 
Assuming that data will be neither 5s and As nor all Os and will be varying ran- 
domly, then a value of 0.80 is appropriate. Otherwise, if you prefer a conserva- 
tive approach, you can use a value of 1.0 as an upper bound. 


Regardless of the approach taken for scaling, once you determine the scale 
factor for the buses, apply this factor to the current values you determined with 
the graphs in section 9.4.1, Local or Global Bus. 


Current Requirement of Output Driver Components 


For example, if a nominal scale factor of 0.80 for the buses is assumed, the 
current contribution from the two buses is as follows: 


Local or Global : 0.80 x 1383 mA = 106.4 mA 


9.4.5 Capacitive Loading Dependence 


Once cycle timing and data dependencies have been accounted for, capaci- 
tive loading effects should be calculated and applied. Figure 9-10 shows the 
current values obtained above as a function of actual load capacitance if the 
load capacitance presented to the buses is less than 80 pF. 


In the previous example, if the load capacitance is 20 pF instead of 80 pF, the 
actual pin current would be 1.66 mA. 


While the slope of the line in Figure 9-10 can be used to interpolate scale fac- 
tors for loads greater than 80 pF, the TMS320C4x is specified to drive output 
loads less than 80 pF; interface timings cannot be guaranteed at higher loads. 
With data dependency and capacitive load scale factors applied to the current 
values for local and global buses, the total supply current required for the 
device for a particular application can be calculated, as described in the next 
section. 


Figure 9-10. Pin Current Versus Outout Load Capacitance (10 MHz) 
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9.5 Calculation of Total Supply Current 


9.5.1 
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The previous sections have discussed currents contributed by different 
sources on the TMS320C4x. Because determinations of actual current values 
are unique and independent for each source, each current source was dis- 
cussed separately. In an actual application, however, the sum of the indepen- 
dent contributions determines the total current requirement for the device. This 
total current value is exhibited as the total current supplied to the device 
through all of the Vpp inputs and returned through the Vss connections. 


Note that numerous Vpp and Vss pins on the device are routed to a variety of 
internal connections, not all of which are common. Externally, however, all of 
these pins should be connected in parallel to 5 V and ground planes, providing 
very low impedance. 


As mentioned previously, because of the inherent differences in operations 
between program segments, it is usually appropriate to consider current for 
each of the segments independently. In this way, peak current requirements 
are readily obtained. Further, you can make average current calculations to 
use in determining heating effects of power dissipation. These effects, in turn, 
can be used to determine thermal management considerations. 


Combining Supply Current Due to All Components 


To determine the total supply current requirements for any given program 
activity, calculate each of the appropriate components and combine them in 
the following sequence: 


1) Start with 130 mA quiescent current requirement. 


2) Add 60 mA for internal operations unless the device is dormant, such as 
when executing IDLE or using an RPTS instruction to perform internal 
and/or external bus operations (see /nternal Operations section on page 
9-7). Internal or external bus operations executed via RPTS do not con- 
tribute an internal operations power supply current component. Therefore, 
current components in the next two steps may still be required, even 
though the 60 mA is omitted. 


3) If significant internal bus operations are being performed (see subsection 
9.3.2, Internal Bus Operations on page 9-8), add the calculated current 
value. 


4) lf external writes are being performed at high speed (see Section 9.4, 
Current Requirements of Output Driver Components on page 9-12), then 
add the values calculated for local and global bus current components. 


5) Add DMA and communication port current requirements if they are used. 


Calculation of Total Supply Current 


The current value resulting from summing these components is the total 
device current requirement for a given program activity. 


9.5.2 Supply Voltage, Operating Frequency, and Temperature Dependencies 


Three additional factors that affect current requirements are supply voltage 
level, operating temperature, and operating frequency. However, these con- 
siderations affect total supply current, not specific components (thatis, internal 
or external bus operations). Note that supply voltages, operating temperature, 
and operating frequency must be maintained within required device specifica- 
tions. 


The scale factor for these dependencies is applied in the same manner as dis- 
cussed in previous sections, once the total current for a particular program 
segment has been determined. Figure 9-11 shows the relative scale factors 
to be applied to the supply current values as a function of both Vpp and operat- 
ing frequency. 


Figure 9-11. Current Versus Frequency and Supply Voltage 
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Power-supply current consumption does not vary significantly with operating 
temperature. However, you can use a scale factor of 2% normalized Ipp per 
50°C change in operating temperature to derate current within the specified 
range noted in the TMS320C4x data sheet. 
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Figure 9-12. Change in Operating Temperature (°C) 
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This temperature dependence is shown graphically in Figure 9-12. Note that 
a temperature scale factor of 1.0 corresponds to current values at 25°C, which 
is the temperature at which all other references in the document are made. 


9.5.3. Design Equation 
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The procedure for determining the power-supply current requirement can be 
summarized in the following equation: 


ltotal = ( Igidle + liops + libus + !xbusglobal + /xbuslocal + DMA + ep) x F x V Xx T 
where: 

Fis ascale factor for frequency 

Vis a scale factor for supply voltage 

Tis a scale factor for operating temperature 


Table 9-2 describes the symbols used in the power-supply current equation 
and gives the value and the number from which the value is obtained. 
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Table 9-2. Current Equation Typical Values (Fo, K = 40 MHz) 


Value 
Symbol Min Typical Max Note Reference 
Igidle2 - 20 vA 50 pA Idle2 shutdown Figure 9-2 
Igidle 130mA 130mA = 130mA Internal idle Figure 9-2 
liops 60 mA 60 mA 60 mA Branch to self internal Figure 9-2 
lipus OmA 50 mA 190 mA Data dependent Figure 9-3, Figure 9-4 
Ixbusglobal (Max) OmA 50mA 280mA Data and Cigag Figure 9-5, Figure 9-6, 
dependent Figure 9-9 
Ixbuslocal (max) OmA 50mA 280mA Data and Cigaq Figure 9-5, Figure 9-6, 
dependent Figure 9-9 
IDMA OmA 50mA 300mA Data and source/ Figure 9-7 
destination dependent 
Iop OmA 50 mA 250 mA Data dependent Figure 9-8 
Notes: 1) All values are scaled by frequency and supply voltage. The nominal tested frequency is 40 MHz. 
2) Externally-driven signals are capacitive-load dependent. 
3) It is unrealistic to add all of the maximum values, since it is impossible to run at those levels. 
9.5.4 Average Current 


Over the course of an entire program, some segments typically exhibit signifi- 
cantly different levels of current for different durations. For example, aprogram 
may spend 80% of its time performing internal operations and draw a current 
of 250 mA; it may spend the remaining 20% of its time performing writes at full 
speed to both buses and drawing 790 mA. 


While knowledge of peak current levels is important in order to establish power 
supply requirements, some applications require information about average 
current. This is particularly significant if periods o 

f high peak current are short in duration. You can obtain average current by 
performing a weighted sum of the current due to the various independent pro- 
gram segments over time. You can calculate the average current for the exam- 
ple in the previous paragraph as follows: 


1=0.8 x 250 mA + 0.2 x 790 mA = 358 mA 


Using this approach, you can calculate average current for any number of pro- 
gram segments. 


9.5.5 Thermal Management Considerations 


Heating characteristics of the TMS320C4x are dependent upon power dis- 
sipation, which, in turn, is dependent upon power supply current. When mak- 


‘C4x Power Dissipation 9-23 


Calculation of Total Supply Current 


9-24 


ing thermal management calculations, you must consider the manner in which 
power supply current contributes to power dissipation and to the TMS320C4x 
package thermal characteristics’ time constant. 


Depending on the sources and destinations of current on the device, some 
current contributions to /pp do not constitute a component of power dissipation 
at 5 volts. That is to say, the TMS320C4x may be acting only as a switch, in 
which case, the voltage drop is across a load and not across the ’C4x. If the 
total current flowing into Vpp is used to calculate power dissipation at 5 volts, 
erroneously large values for package power dissipation will be obtained. The 
error occurs because the current resulting from driving a logic high level into 
a DC load appears only as a portion of the current used to calculate system 
power dissipation due to Vpp at 5 volts. Power dissipation is defined as: 


P=Vx | 


where P is power, Vis voltage, and / is current. If device outputs are driving 
any DC load to a logic high level, only a minor contribution is made to power 
dissipation because CMOS ouiputs typically drive to a level within a few tenths 
of a volt of the power supply rails. If this is the case, subtract these current com- 
ponents out of the TMS320C4x supply current value and calculate their con- 
tribution to system power dissipation separately (see Figure 9-13). 


Calculation of Total Supply Current 


Figure 9—13. Load Currents 


IDD = !ouT 


IOUT 
TMS320C4x 


TMS320C4x 


Device Output Driven Low 


Iss = lOUT 


Furthermore, external loads draw supply current (/pp) only when outputs are 
driven high, because when outputs are in the logic zero state, the device is 
sinking current through Ves, which is supplied from an external source. There- 
fore, the power dissipation due to this component will not contribute through 
Ipp but will contribute to power dissipation with a magnitude of: 


P=VoL X lot 


where Vo, is the low-level output voltage and /o, is the current being sunk by 
the output, as shown in Figure 9-13. The power dissipation component due 
to outputs being driven low should be calculated and added to the total power 
dissipation. 


When outputs with DC loads are being switched, the power dissipation compo- 
nents from outputs being driven high and outputs being driven low should be 
averaged and added to the total device power dissipation. Power components 
due to DC loading of the outputs should be calculated separately for each pro- 
gram segment before average power is calculated. 


Note that unused inputs that are left unconnected may float to a voltage level 
that will cause the input buffer circuits to remain in the linear region, and there- 
fore contribute a significant component to power supply current. Accordingly, 
if you want absolute minimum power dissipation, you should make any unused 
inputs inactive by either grounding or pulling them high. If several unused 
inputs must be pulled high, they can be pulled high together through one resis- 
tor to minimize component count and board space. 
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When you use power dissipation values to determine thermal management 
considerations, use the average power unless the time duration of individual 
program segments is long. The thermal characteristics of the TMS320C40 in 
the 325-pin PGA package are exponential in nature with a time constant on 
the order of minutes. Therefore, when subjected to a change in power, the tem- 
perature of the device package will require several minutes or more to reach 
thermal equilibrium. 


If the duration of program segments exhibiting high power dissipation values 
is short (on the order of a few seconds) in comparison to the package thermal 
characteristics’ time constant, use average power calculated in the same man- 
ner as average current described in the previous section. Otherwise, calculate 
maximum device temperature on the basis of the actual time required for the 
program segments involved. For example, if a particular program segment 
lasts for 7 minutes, the device essentially reaches thermal equilibrium due to 
the total power dissipation during the period of device activity. 


Note that the average power should be determined by calculating the power 
for each program segment (including all considerations described above) and 
performing a time average of these values, rather than simply multiplying the 
average current by Vpp, as determined in the previous subsection. 


Calculate specific device temperature by using the TMS320C4x thermal 
impedance characteristics included in the TMS320C4x data sheet. 


Example Supply Current Calculations 


9.6 Example Supply Current Calculations 


9.6.1 Processing 


9.6.2 Data Output 


AnFFT represents a typical DSP algorithm. The FFT code used in this calcula- 
tion processes data in the RAM blocks. The entire algorithm consists mainly 
of internal bus operations and hence includes quiescent and, in general, inter- 
nal operations. At the end of the processing, the results are written out on the 
global and local bus. Therefore, the algorithm exhibits a higher current require- 
ment during the write portion where the external bus is being used significantly. 


The processing portion of the algorithm is 95% of the total algorithm. During 
this portion, the power-supply current is required for the internal circuitry only. 
Data is processed in several loops that make up the majority of the algorithm. 
During these loops, two operands are transferred on every cycle. The current 
required for internal bus usage, then, is 60 mA (from Figure 9-3). The data is 
assumed to be random. A data value scale factor of 0.93 is used (from 
Figure 9-4). This value scales 60 mA, yielding 55.8 mA for internal bus opera- 
tions. Adding 55.8 mA to the quiescent current requirement and internal opera- 
tions current requirement yields a current requirement of 245.8 mA for the 
major portion of the algorithm. 
| = Iq + liops + libus 
|= 130 mA + 60 mA + (60 mA) (0.93) 

= 245.8 mA 


The portion of the algorithm corresponding to writing out data is approximately 
5% of the total algorithm. Again, the data that is being written is assumed to 
be random. From Figure 9-4 and Figure 9-10, scale factors of 0.93 and 0.8 
are used for derating due to data value dependency for internal and local 
buses, respectively. During the data dump portion of the code, a load and a 
store are performed every cycle; however, the parallel load/store instruction 
is in an RPTS loop. Therefore, there is no contribution due to internal opera- 
tions, because the instruction is fetched only once. The only internal contribu- 
tions are due to quiescent and internal bus operations. Figure 9—5 indicates 
a 23-mA current contribution due to writes every available cycle. Therefore, 
the total contribution due to this portion of the code is: 


|= Ig + libus + Ixbus 

or 

| = 130 mA + (60 mA) (0.93) + 85 mA + (23 mA) (0.8) 
= 289.2 mA 
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9.6.3 Average Current 


The average current is derived from the two portions of the algorithm. The pro- 
cessing portion took 95% of the time and required about 245.8 mA; the data 
dump portion took the other 5% and required about 411.6 mA. The average 
is calculated as: 


lavg = (0.95) (245.8 mA) + (0.05) (289.2 mA) 
= 247.97 mA 


From the thermal characteristics specified in the TMS320C4x User’s Guide, 
it can be shown that this current level corresponds to a case temperature of 
28°C. This temperature meets the maximum device specification of 85°C and 
hence requires no forced air cooling. 


9.6.4 Experimental Results 
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A photograph of the power-supply current for the FFT, using a 40-MHz system 
clock, is shown in Appendix A. During the FFT processing, the current varied 
between 190 and 220 mA. The current during external writes had a peak of 230 
mA, and the average current requirement as measured on a digital multimeter 
was 205 mA. Scaling those results to the 50-MHz calculations yielded results 
that were close to the actual measured power-supply current. 


Design Considerations 


9.7 Design Considerations 


Designing systems for minimum power dissipation involves reducing device 
operating current requirements due to signal switching rate, capacitive load- 
ing, and other effects. Selective consideration of these effects makes it pos- 
sible to optimize system performance while minimizing power consumption. 
This section describes current reduction techniques based on operating cur- 
rent dependencies of the device as discussed in previous sections of this doc- 
ument. 


9.7.1 System Clock and Signal Switching Rates 


Since current (and therefore, power) requirements of CMOS devices are 
directly proportional to switching frequency, one potential approach to mini- 
mizing operating power is to minimize system clock frequency and signal 
switching rates. Although performance is often directly proportional to system 
clock and signal switching rates, tradeoffs can be made in both areas to 
achieve an optimal balance between power usage and performance in the 
design of a system. 


If reducing power is a primary goal, and a given system design does not have 
particularly demanding performance requirements, the system clock rate can 
be reduced with the corresponding savings in power. Minimum power is real- 
ized when system clock rates are only as fast as necessary to achieve required 
system performance. Additionally, if overall system clock rates cannot be 
reduced, an alternative approach to power reduction is to reduce clock speed 
wherever possible during periods of inactivity. 


Also, the appropriate choice of clock generation approach will ensure mini- 
mum system power dissipation. The use of an external oscillator rather than 
the on-chip oscillator can result in lower power device and system power dis- 
sipation levels. As described previously, the internal oscillator can require as 
much as 10 mA when operating at 40 MHz. If you use an external oscillator 
that requires less than 10 mA for clock generation, overall system power is 
reduced. 


When considering switching rates of signals other than the system clock, the 
main consideration is to minimize switching. Specifically, any unnecessary 
switching should be avoided. Outputs or inputs that are unused should either 
be disabled, tied high, or grounded, whichever is appropriate. Additionally, out- 
puts connected to external circuitry should drive other power dissipation ele- 
ments only when absolutely necessary. 
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9.7.2 Capacitive Loading of Signals 


Current requirements are also directly proportional to capacitive loading. 
Therefore, all capacitive loading should be minimized. This is especially signif- 
icant for device outputs. 


The approaches to minimize capacitive loading are consistent with efficient PC 
board layout and construction practices. Specifically, signal runs should be as 
short as possible, especially for signals with high switching rates. Also, signals 
should not run long distances across PC boards to edge connectors unless 
absolutely necessary. 


Note that the buffering of device outputs that must drive high capacitive loads 
reduces supply current for the TMS320C40, but this current is translated to the 
buffering device. Whether or not this is a valid tradeoff must be determined at 
the system level. The two main considerations are: 1) whether the power 
required by the buffers is more or less than the power required from the ’C40 
to drive the load in question, and 2) whether or not off-loading the power to the 
buffers has any implications with respect to system power-down modes. It may 
be desirable to use buffers to drive high capacitive loads, even though they 
may require more current than the TMS320C40, especially in cases where 
part of the system may be powered down but the TMS320C40 is still required 
to interface to other low capacitance loads. 


9.7.3 DC Component of Signal Loading 
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In order to achieve lowest device current requirements, the internal and exter- 
nal DC load component of device input and output signal loading must also be 
minimized . 


Any device inputs that are unused and left floating may cause excessively high 
DC current to be drawn by their input buffer circuitry. This occurs because if 
an input is left unconnected, the voltage on the input may float to a level that 
causes the input buffer to be biased at a point within its range of linear opera- 
tion. This can cause the input buffer circuit to draw a significant DC current 
directly from Vpp to ground. Therefore, any unused device inputs should be 
pulled up to Vpp via a resistor pullup of nominally 20 kQ, or driven high with 
an unused gate. Input-only pins that are not used can be pulled up in parallel 
with other inputs of the same type with a single gate or resistor to minimize sys- 
tem component count. In this case, up to 15 or more standard device inputs 
can be pulled up with a single resistor. 


Any device I/O pins that are unused should be selected as outputs. This avoids 
the requirement for pull-ups (to ensure that the I/O input stage is not biased 
in the linear region) and therefore eliminates an unnecessary current compo- 
nent. 


Design Considerations 


For any device output, any DC load present is directly reflected in the system’s 
power-supply current. Therefore, DC loading of outputs should be reduced to 
a minimum. If DC currents are being sourced from the address bus outputs, 
the address bus should be set to a level that minimizes the current through the 
external load. This can be accomplished by performing a dummy read from an 
external address. 


For I/O pins that must be used in both the input and output modes, individual 
pullup resistors of nominally 20 kQ should be used to ensure minimum power 
dissipation if these pins are not always driven to a valid logic state. This is par- 
ticularly true of the data-bus pins. When the bus is not being driven explicitly, 
itis left floating, which can cause excessively high currents to be drawn on the 
input buffer section of all 64 bits of the bus. In this case, because all 64 data 
bus bits are normally used independently in most applications, each data-bus 
pin should be pulled up with a separate resistor for minimum power. 
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Development Support and Part Order Information 


This chapter provides development support information, socket descriptions, 
device part numbers, and support tool ordering information for the ’C4x. 


Each ’C4x support product is described in the TMS320 Family Development 
Support Reference Guide (literature number SPRU011). In addition, more 
than 100 third-party developers offer products that support the Tl TMS320 
family. For more information, refer to the TMS320 Third-Party Reference 
Guide (literature number SPRU052). 


For information on pricing and availability, contact the nearest TI Field Sales 
Office or authorized distributor. See the list at the back of this book. 
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10.1 Development Support 


10-2 


Texas Instruments offers an extensive line of development tools for the 
TMS320C4x generation of DSPs, including tools to evaluate the performance 
of the processors, generate code, develop algorithm implementations, and ful- 
ly integrate and debug software and hardware modules. 


The following products support the development of ’C4x applications: 


Code Generation Tools 


a 


L 


L 


The optimizing ANSI C compiler translates ANSI C language directly into 
highly optimized assembly code. You can then assemble and link this code 
with the Tl assembler/ linker, which is shipped with the compiler. It sup- 
ports both ’C3x and ’C4x assembly code. This product is currently avail- 
able for PCs (DOS, DOS extended memory, OS/2), VAX/VMS and SPARC 
workstations. See the TMS320 Floating-Point DSP Optimizing C Compiler 
User’s Guide (SPRU034) for detailed information about this tool. 


The assembler/linker converts source mnemonics to executable object 
code. It supports both ’C3x and ’C4x assembly code. This product is cur- 
rently available for PCs (DOS, DOS extended memory, OS/2). The 
’C3x/'C4x assembler for the VAX/VMS and SPARC workstations is only 
available as part of the optimizing ’C3x/’C4x compiler. See the 7MS320 
Floating-Point DSP Assembly Language Tools User’s Guide (SPRU035) 
for detailed information about available assembly-language tools. 


The digital filter design package helps you design digital filters. 


System Integration and Debug Tools 


The simulator simulates (via software) the operation of the 'C4x and can 
be used in C and assembly software development. This product is current- 
ly available for PCs (DOS, Windows) and SPARC workstations. See the 
TMS320C4x C Source Debugger User’s Guide (SPRU054) for detailed in- 
formation about the debugger. 


The XDS510 emulator performs full-speed in-circuit emulation with the 
’C4x, providing access to all registers as well as to internal and external 
memory of the device. It can be used in C and assembly software develop- 
ment and has the capability to debug multiple processors. This product is 
currently available for PCs (DOS, Windows, OS/2) and SPARC worksta- 
tions. This product includes the emulator board (emulator box, power sup- 
ply, and SCSI connector cables in the SPARC version), the ’C4x C Source 
Debugger and the JTAG cable. 


Because 'C3x and ’C5x XDS510 emulators also come with the same emu- 
lator board (or box) as the ’C4x, you can buy the ’C4x C Source Debugger 
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Software as a separate product called ’'C4x C Source Debugger Conver- 
sion Software. This enables you to debug ’C3x/’C4x applications with the 
same emulator board. The emulator cable that comes with the ’C3x 
XDS510 emulator cannot be used with the ’'C4x. A JTAG emulation con- 
version cable (see Section 10.3) is needed instead. The emulator cable 
that comes with the ’C5x XDS510 emulator can also be used for the ’C4x 
without any restriction. See the TMS320C4x C Source Debugger User’s 
Guide (SPRU054) for detailed information about the ‘C4x emulator. 


(1 The parallel processing development system (PPDS) is a stand-alone 
board with four ’C4xs directly connected to each other via their commu- 
nication ports. Each ’C4x has 64K-words SRAM and 8K-byte EPROM as 
local memory, and they all share a 128K-word global SRAM. See the 
TMS320C4x Parallel Processing Development System Technical Refer- 
ence (SPRU0O75) for detailed information about the PPDS. 


(J The emulation porting kit (EPK) enables you to integrate emulation 
technology directly into your system without the need of an XDS510 
board. This product is intended to be used by third parties and high-vol- 
ume board manufacturers and requires a licensing agreement with Texas 
Instruments. 


10.1.1 Third-Party Support 


The TMS320 family is supported by products and services from more than 100 
independent third-party vendors and consultants. These support products 
take various forms (both as software and hardware), from cross-assemblers, 
simulators, and DSP utility packages to logic analyzers and emulators. The ex- 
pertise of those involved in support services ranges from speech encoding and 
vector quantization to software/hardware design and system analysis. 


See the TMS320 Third-Party Support Reference Guide (literature number 
SPRU052) for a more detailed description of services and products offered by 
third parties. 


10.1.2 The DSP Hotline 


For answers to TMS320 technical questions on device problems, develop- 
ment tools, documentation, upgrades, and new products, you can contact the 
DSP hotline via: 


Lj Phone: (713)274—2320 Monday through Friday from 8:30 a.m. to 5:00 
p.m. central time 


LC) Fax: (713)274—2324. (US DSP Hotline), +33—1-3070—1032 (European 
DSP hotline) 
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(1 Electronic Mail: 4389750@mcimail.com 


To ask about third-party applications and algorithm development packages, 
contact the third party directly. Refer to the 7MS320 Third-Party Support Ref- 
erence Guide (SPRU052) for addresses and phone numbers. 


Extensive DSP documentation is available; this includes data sheets, user’s 
guides, and application reports. Contact the hotline for information on litera- 
ture that you can request from the Literature Response Center, 
(800)477-8924. 


The DSP hotline does not provide pricing information. Contact the nearest 
TI Field Sales Office for prices and availability of TMS320 devices and support 
tools. 


10.1.3 The Bulletin Board Service (BBS) 


The TMS320 DSP Bulletin Board Service (BBS) is a telephone-line computer 
service that provides information on TMS320 devices, specification updates 
for current or new devices and development tools, silicon and development 
tool revisions and enhancements, new DSP application software as it be- 
comes available, and source code for programs from any TMS320 user’s 
guide. 


You can access the BBS via: 


(1 Modem: (300-, 1200-, or 2400-bps) dial (713)274—2323. Set your modem 
to 8 data bits,1 stop bit, no parity. 


To find out more about the BBS, refer to the 7MS320 Family Development 
Support Reference Guide (literature number SPRU0O11). 


10.1.4 Internet Services 
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Texas Instruments offers two Internet-accessible services for DSP support: an 
ftp site, and a www site. 


(J World-wide web: Point your browser at http:/www.ti.com to access Tl’s 
web site. At the site, you can follow links to find product information, online 
literature, an online lab, and the 320 Hotline online. 


(1 FTP: Use anonymous fip to ti.com (Internet port address 192.94.94.1) to 
access copies of the files found on the BBS. The BBS files are located in 
the subdirectory called mirrors. 
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10.1.5 Technical Training Organization (TTO) TMS320 Workshops 


*C4x DSP Design Workshop. This workshop is tailored for hardware and soft- 
ware design engineers and decision-makers who will be designing and utiliz- 
ing the ’C4x generation of DSP devices. Hands-on exercises throughout the 
course give participants a rapid start in developing ’C4x design skills. Micro- 
processor/assembly language experience is required. Experience with digital 
design techniques and C language programming experience is desirable. 


These topics are covered in the 'C4x workshop: 


’C4x architecture/instruction set 

Use of the PC-based software simulator 
Use of the ’C3x/’C4x assembler/linker 
C programming environment 

System architecture considerations 
Memory and I/O interfacing 
Development support 


OUUOUCULU 


For registration information, pricing, or to enroll, call (800)336—5236, ext. 
3904. 
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10.2 Sockets 
Table 10-1 contains available sockets that accept the 325-pin ’C40 pin grid 


array (PGA) and the 304—pin ’C44 Plastic Quad Flatpack (PQF). Table 10-2 
lists the phone numbers of the manufacturers listed in Table 10-1. 


Table 10-1. Sockets that Accept the 325-pin ’C40 and the 304-pin ’'C44 


Manufacturer Type Part Number 

Advanced Interconnections C40-wire-wrap socket 3919 

AMP C40-tool-activated ZIF socket AMP 382533-9 

AMP Actuation tool for AMP382533-9 AMP 854234-1 

AMP C40-handle-activated ZIF socket AMP 382320-9 

AMP C40-PGA ZIF AMP 55291-2 

Emulation Technology C40-logic analyzer socket BZ6—325—H6A35-TMS320C40Z 
Emulation Technology C40-wire-wrap socket AB-325—H6A35Z—P13-M 

Mark Eyelet C40-wire-wrap socket MP325—73311D16 

Yamaichi TMS320C44 PDB Socket (304 pins) ic201-3044-004 


Table 10-2. Manufacturer Phone Numbers 


Manufacturer Phone Number 
AMP (717) 564-0100 
Advanced Interconnections (401) 823-5200 
Emulation Technology (408) 982-0660 
Mark Eyelet (203) 756-8847 
Yamaichi (408) 456-0797 


The remainder of this section describes two available sockets that accept the 
’C4x pin grid array (PGA). Both sockets feature zero insertion force (ZIF): 


_j A tool-activated ZIF socket (TAZ) 
Lj Ahandle-activated ZIF socket (HAZ) 


The sockets described herein are manufactured by AMP Incorporated. 
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10.2.1 Tool-Activated ZIF PGA Socket (TAZ) 


Figure 10-1. Tool-Activated ZIF Socket 


0.350 in. Max. 
2.061 in. Max. 
Description: 
AMP part number: 382533-9 
Pin positions: 325 
Soldertail length: 0.170 in. for PC boards 0.125 in. 
thick (other tail lengths available) 
Actuator tool 354234-1 
Features: 
[J Slightly larger than a PGA device 
Lj Easy package loading because of large funnel entry 
Lj Zero insertion force 
[1 Contact wiping action during insertion ensures clean contact points 
1 Spring-loaded cover ensures proper loading 
[J Can be used with robotic insertion and removal 
Lj) Horizontal vs. vertical socket forces prevent damage to the device 
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10.2.2 Handle-Activated ZIF PGA Socket (HAZ) 


Figure 10-2. Handle-Activated ZIF Socket 


2.700 in. Max. 


0.350 in. Max. 


ati ae 


Description: 

AMP part number: 382320-9 

Pin positions: 325 

Solder tail length: 0.170 in. for PC boards 0.125 in. 
thick (other tail lengths available) 

Features: 

[J Can be used for test and burn-in 

C1 Spring contacts are normally closed 

[j Easy package loading because of large funnel entry 

Lj Zero insertion force 

1 Contact wiping action during socket closing ensures clean contact points 

[1 Maximum Operating temperature is 160° C (to allow burn-in capability) 
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10.3 Part Order Information 


This section describes the part numbers of ’C4x devices, development support 
hardware, and software tools. 


10.3.1 Nomenclature 


To designate the stages in the product development cycle, Texas Instruments 
assigns prefixes to the part numbers of all TMS320 devices and support tools. 
Each TMS320 device has one of three prefixes: TMX, TMP, or TMS. Each sup- 
port tool has one of two possible prefix designators: TMDX or TMDS. These 
prefixes represent evolutionary stages of product development from engineer- 
ing prototypes (TMX/TMDX) through fully qualified production devices and 
tools (TMS/TMDS). This development flow is defined below. 


Device Development Evolutionary Flow: 


TMX The part is an experimental device that is not necessarily representa- 
tive of the final device’s electrical specifications. 

TMP Thepartis adevice from a final silicon die that conforms to the device’s 
electrical specifications but has not completed quality and reliability 
verification. 


TMS _ The partis a fully qualified production device. 
Support Tool Development Evolutionary Flow: 


TMDX The development-support product that has not yet completed Texas 
Instruments internal qualification testing. 


TMDS The development-support product is a fully qualified development 
support product. 


TMX and TMP devices and TMDX development support tools are shipped with 
the following disclaimer: 


“Developmental product is intended for internal evaluation purposes.” 


TMS devices and TMDS development support tools have been fully character- 
ized, and the quality and reliability of the device has been fully demonstrated. 
Texas Instruments standard warranty applies to these products. 


a OEE EE ee ee 
Note: 


Itis expected that prototype devices (TMX or TMP) have a greater failure rate 
than standard production devices. Texas Instruments recommends that 
these devices notbe used in any production system, because their expected 
end-use failure rate is still undefined. Only qualified production devices 


should be used. 
_———SSS——_—_——__________—=_=_=_—_=_==_—=[——_={[=_—_—_—_——_———_—=_=_=== 
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TI device nomenclature also includes the device family name and a suffix. This 
suffix indicates the package type (for example, N, FN, or GB) and temperature 
range (for example, L). Figure 10-3 provides a legend for reading the com- 
plete device name for any TMS320 family member. 


Figure 10-3. Device Nomenclature 


TMS 320 C 40 GF L TEMPERATURE RANGE 
(AMBIENT) 


PREFIX : 
SMJ = Ceramic QML ‘ : ee . oo 
TMX = experimental device Ee gis 7G 


TMP = prototype device 
TMS = qualified device 
SMQ = Plastic QML 


M = -55 to 125°C 
S = -55 to 100°C 


PACKAGE TYPE 


FD =ceramic leadless CC 
DEVICE FAMILY FN =plastic leaded CC 
320 = TMS320 Family FZ =ceramic CER-QUAD 


GB = 181-pin ceramic PGA 

GE =181-pin ceramic PGA 

GF = 325-pin ceramic PGA 

HFH = 352-leaded CER-QFP 

- J ceramic DIP 

Q = ae EPROM JD ceramic DIP, side-brazed 

N = plastic DIP 

TA = tape automated bonding 
(encapsulated) 

TB = tape automated bonding 
(bare die) 

KGD = known good die 

PDB = 304-pin plastic quad 


flatpack 
DEVICE 


TECHNOLOGY: 


10.3.2 Device and Development Support Tools 


Table 10-3 lists ’C4x device part numbers. Table 10-4 lists the development 
support tools available for the 'C4x DSP, their part numbers, and the platform 
on which they run. 
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Table 10-3. Device Part Numbers 


Device Part Number 


TMS320C40GFL 
TMS320C40GFL60 
TMS320LC40GFL60 
TMS320LC40GFL80 
TMS320C44PDB50 
TMS320C44PDB60 
SMJ320C40GFM40 
SMJ320C40GFMS50 
SMJ320C40HFHM40 
SMJ320C40HFHM50 
SMJ320C40TAM40 
SMJ320C40TBM40 
TMS320C40TAL50 
SMJ320C40TAM50 
SMJ320C40TBM50 
TMS320C40TAL60 
SMJ320C40KGDM40 
SMJ320C40KGDM50 
TMS320C40KGDL50 
TMS320C40KGDL60 


Voltage 


5V 

5V 
3.3V 
3.3V 


Operating 
Frequency 


50 MHz/40 ns 
60 MHz/33 ns 
60 MHz/33 ns 
80 MHz/25 ns 
50 MHz/40 ns 
60 MHz/33 ns 
40MHz/50 ns 
50MHz/40 ns 
40MHz/50 ns 
50MHz/40 ns 
40MHz/50ns 
40MHz/50ns 
50MHz/40ns 
50MHz/40ns 
50MHz/40ns 
60MHz/33ns 
40MHz/50ns 
50MHz/40ns 
50MHz/40ns 
60MHz/33ns 
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Package 


325-pin ceramic PGA 
325-pin ceramic PGA 
325-pin ceramic PGA 
325-pin ceramic PGA 
304-pin PQFP 

304-pin PQFP 

325-pin ceramic PGA 
325-pin ceramic PGA 
352-lead ceramic PGA 


352-lead ceramic PGA 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (bare die) 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (bare die) 


324 pad TAB tape (encapsulated) 


Known Good Die 
Known Good Die 
Known Good Die 


Known Good Die 


Development Support and Part Order Information 
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Table 10-4. Development Support Tools Part Numbers 


Development Tool Part Number 

C Compiler/Assembler/Linker TMDS3243855-02 
C Compiler/Assembler/Linker TMDS3243255-08 
C Compiler/Assembler/Linker TMDS3243555-08 
Assembler/Linker TMDS3243850-02 
Simulator (C language) TMDS3244851-02 
Simulator (C language) TMDS3244551-09 
Tartan Floating Point Library 320FLO-PC-C40 
Tartan Floating Point Library 320FLO-SUN-C40 
Digital Filter Design Package DFDP 

C Source Debugger Conversion Software TMDS3240140 

C Source Debugger Conversion Software TMDS3240640 
Emulation Porting Kit TMDX3240040t 
*C3x/’C4x Tartan C/C++ Compiler/Assembler/Linker TAR-—CCM-—PC 
*C3x/’C4x Tartan C/C++ Compiler/Assembler/Linker TAR-CCM-SP 


*C3x/’C4x Tartan C/C++ Compiler/Assembler/Linker/ TAR-SIM—PC 
Simulator 


*C3x/’C4x Tartan C/C++ Compiler/Assembler/Linker/ TAR-SIM-—SP 
Simulator 


*C3x/’C4x Tartan C/C++ XDS510 Debugger TAR-DEG-—XDS-PC 
*C3x/’C4x Tartan C/C++ XDS510 Debugger TAR-—DEG-—XDS-—SP 
XDS510 Emulatort TMDS3260140 
XDS510WS Emulator$ TMDS3260640 
PC/Sparc JTAG Emulation Cable TMDS3080001 
Parallel Processing Development System TMDX3261040 


t Requires licensing agreement. 


t Includes XDS510WS box, SCSI cable, power supply, and JTAG cable. TMDS3240640 C-source debugger software not 


included. 


Platform 
PC (DOS, OS/2) 


VAX (VMS) 
SPARC (Sun OS) 
PC (DOS) 

PC (DOS, Windows) 
SPARC (Sun OS) 
PC (DOS) 
SPARC (Sun OS) 
PC (DOS) 

PC (XDS510) 
Sun (XDS510WS) 
PC (DOS) 
SPARC 

PC (DOS) 


SPARC 


PC (DOS, Windows) 
SPARC (Sun OS) 


PC (DOS, OS/2, Windows) 


Sun (SPARC SCSI) 
XDS$510/XDS510WS 


XDS$510/XDS510WS 


§ Includes XDS510 board and JTAG cable. TMDS$3240140 C-source debugger software not included. 
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Chapter 11 


XD$510 Emulator Design Considerations 


This chapter explains the design requirements of the XDS510 emulator with 
respect to JTAG designs, and discusses the XDS510 cable (manufacturing 
part number 2617698-0001). This cable is identified by a label on the cable pod 
marked JTAG 3/5V and supports both standard 3-volt and 5-volt target system 
power inputs. 


The term JTAG, as used in this book, refers to Tl scan-based emulation, which 
is based on the IEEE 1149.1 standard. 


Topic Page 
11.1 Designing Your Target System’s ............. 0c eee eee eens 11-2 
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11.6 Emulation Timing Calculations ................ cece eee ee eens 11-6 
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11.8 Mechanical Dimensions for the 14-Pin Emulator Connector .... 11-12 
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Designing Your Target System's Emulator Connector (14-Pin Header) 


11.1 Designing Your Target System’s Emulator Connector (14-Pin Header) 


JTAG target devices support emulation through a dedicated emulation port. 
This port is a superset of the IEEE 1149.1 standard and is accessed by the 
emulator. To communicate with the emulator, your target system must have a 
14-pin header (two rows of seven pins) with the connections that are shown 
in Figure 11-1. Table 11-1 describes the emulation signals. 


Figure 11-1. 14-Pin Header Signals and Header Dimensions 


TMS TRST 
not an Pin-to-pin spacing, 0.100 in. (X.Y) 
PD (Vcc) no pin (key)t Pin width, 0.025-in. square post 
TDO GND Pin length, 0.235-in. nominal 
TCK_RET GND 
TCK GND 
EMUO EMU1 


t While the corresponding female position on the cable connector is plugged to prevent improper 
connection, the cable lead for pin 6 is present in the cable and is grounded, as shown in the sche- 
matics and wiring diagrams in this document. 


Table 11-1. 14-Pin Header Signal Descriptions 
Emulatort = Targett 


Signal Description State State 
TMS Test mode select O | 
TDI Test data input O | 
TDO Test data output | O 
TCK Test clock. TCK is a 10.368-MHz clock O | 


source from the emulation cable pod. This 
signal can be used to drive the system test 


clock 
TRST+ Test reset O | 
EMUO Emulation pin 0 | /0 
EMU1 Emulation pin 1 | 1/0 
PD(Vcc) Presence detect. Indicates that the emula- O 


tion cable is connected and that the target is 
powered up. PD should be tied to Vcc in the 
target system. 


TCK_RET_ Test clock return. Test clock input to the O 
emulator. May be a buffered or unbuffered 
version of TCK. 


GND Ground 


TI = input; O = output 

+Do not use pullup resistors on TRST: it has an internal pulldown device. In a low-noise 
environment, TRST can be left floating. In a high-noise environment, an additional pulldown 
resistor may be needed. (The size of this resistor should be based on electrical current 
considerations.) 


11.2 Bus Protocol 


Designing Your Target System's Emulator Connector (14-Pin Header) 


Although you can use other headers, recommended parts include: 


straight header, unshrouded DuPont Connector Systems 
part numbers: 65610-114 

65611-114 

67996-114 

67997-114 


The IEEE 1149.1 specification covers the requirements for the test access port 
(TAP) bus slave devices and provides certain rules, summarized as follows: 


[1 The TMS/TDI inputs are sampled on the rising edge of the TCK signal of 
the device. 


Li The TDO output is clocked from the falling edge of the TCK signal of the 
device. 


When these devices are daisy-chained together, the TDO of one device has 
approximately a half TCK cycle setup to the next device’s TDI signal. This type 
of timing scheme minimizes race conditions that would occur if both TDO and 
TDI were timed from the same TCK edge. The penalty for this timing scheme 
is a reduced TCK frequency. 


The IEEE 1149.1 specification does not provide rules for bus master (emula- 
tor) devices. Instead, it states that it expects a bus master to provide bus slave 
compatible timings. The XDS510 provides timings that meet the bus slave 
rules. 


11.3 IEEE 1149.1 Standard 


For more information concerning the IEEE 1149.1 standard, contact IEEE 
Customer Service: 


Address: IEEE Customer Service 
445 Hoes Lane, PO Box 1331 
Piscataway, NJ 08855-1331 


Phone: (800) 678—IEEE in the US and Canada 
(908) 981-1393 outside the US and Canada 


FAX: (908) 981-9667 Telex: 833233 
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11.4 JTAG Emulator Cable Pod Logic 


Figure 11—2 shows a portion of the emulator cable pod. These are the function- 
al features of the pod: 


L 


L 


Signals TDO and TCK_RET can be parallel-terminated inside the pod if 
required by the application. By default, these signals are not terminated. 


Signal TCK is driven with a 74LVT240 device. Because of the high-current 
drive (32 MA Io, /IQx), this signal can be parallel-terminated. If TCK is tied 
to TCK_RET, then you can use the parallel terminator in the pod. 


Signals TMS and TDI can be generated from the falling edge of TCK_RET, 
according to the IEEE 1149.1 bus slave device timing rules. 


Signals TMS and TDI are series-terminated to reduce signal reflections. 


A 10.368-MHz test clock source is provided. You may also provide your 
own test clock for greater flexibility. 


Figure 11-2. JTAG Emulator Cable Pod Interface 


TDO (Pin 7) 


GND (Pins 4,6,8,10,12) 


EMUO (Pin 13) 


EMU1 (Pin 14) 


TCK_RET (Pin 9)t 


PD(Vcc) (Pin 5) 


74F175 


TMS (Pin 1) 


TDI (Pin 3) 


TCK (Pin 11)T 


TRST (Pin 2) 


t The emulator pod uses TCK_RET as its clock source for internal synchronization. TCK is provided 
as an optional target system test clock source. 


JTAG Emulator Cable Pod Signal Timing 


11.5 JTAG Emulator Cable Pod Signal Timing 


Figure 11-3 shows the signal timings for the emulator cable pod. Table 11-2 
defines the timing parameters. These timing parameters are calculated from 
values specified in the standard data sheets for the emulator and cable pod 
and are for reference only. Texas Instruments does not test or guarantee these 
timings. 


The emulator pod uses TCK_RET as its clock source for internal synchroni- 
zation. TCK is provided as an optional target system test clock source. 


Figure 11-3. JTAG Emulator Cable Pod Timings 


> 1 >i 
TCK_RET , | 1.5V 
ce — 3——» 
TMS/TDI 
le—4 me 
, le 6p 


Table 11-2. Emulator Cable Pod Timing Parameters 


No. 


oa RW YP = 


Reference Description Min Max Units 
tc(TCK) TCK_RET period 35 200 ns 
tw(TCKH) TCK_RET high-pulse duration 15 ns 
tw(TCKL) TCK_RET low-pulse duration 15 ns 
ta(TMS) Delay time, TMS/TDI valid from TCK_RET low 6 20 os 
tsu(TDO) TDO setup time to TCK_RET high 3 ns 
th(TDO) TDO hold time from TCK_RET high 12 ns 
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11.6 Emulation Timing Calculations 


The following examples help you calculate emulation timings in your system. 
For actual target timing parameters, see the appropriate device data sheets. 


Assumptions: 


tsu(TTMS) Target TMS/TDI setup to TCK high 10 ns 
td(TTDO) Target TDO delay from TCK low 15ns 
td(bufmax) Target buffer delay, maximum ae 
td(bufmin) Target buffer delay, minimum 1ns 
t(bufskew) Target buffer skew between two devices 1.35 ns 


in the same package: 
[td(bufmax) — ta(bufmin)] x 0.15 


t(TCKfactor) | Assume a 40/60 duty cycle clock 0.4 
(40%) 
Given in Table 11-2 ( on page 11-5): 
ta(TMSmax) | Emulator TMS/TDI delay from TCK_RET 20 ns 
low, maximum 
tsu(TDOmin) | TDO setup time to emulator TCK_RET 3ns 


high, minimum 


There are two key timing paths to consider in the emulation design: 


L} The TCK_RET-to-TMS/TDI path, called tog(TcK_RET-TMS/TDI) 
L} The TCK_RET-to-TDO path, called tpg/tTcK_RET-TDO) 


Of the following two cases, the worst-case path delay is calculated to deter- 
mine the maximum system test clock frequency. 


Case 1: Single processor, direct connection, TMS/TDI timed from TCK_RET low. 


It (TMSmax) + 'su jrmi)| 


_ 
pd (TCK_RET-TMS/TDI) UTcktactor) 


_ [20ns + 10ns] 
0.4 
= 75ns (13.3 MHz) 


ty (TTD0) * tsu room| 
tod (TCK_RET-TDO) — i | 


TCKfactor) 


_ [15ns + 3ns] 
= 0.4 
= 45ns (22.2 MHz) 


In this case, the TCK_RET-to-TMS/TDI path is the limiting factor. 


Emulation Timing Calculations 
Case 2: Single/multiprocessor, TMS/TDI/TCK buffered input, TDO buffered output, 
TMS/TDI timed from TCK_RET low. 


Its (TMsmax) + tsucrtms) + t (outskew)| 
tod (TCK_RET-TMS/TDI) = i 


TCKfactor) 


[20ns + 10ns + 1.35ns| 
0.4 


= 78.4ns (12.7 MHz) 


7 ta (TTDO) + tsuctDOmin) + ta ees 
tod (TCK_RET-TDO) t 


TCKfactor) 


_ [15ns + 3ns + 10ns] 
0.4 


= 70ns (14.3 MHz) 


In this case, the TCK_RET-to-TMS/TDI path is the limiting factor. 


Inamultiprocessor application, it is necessary to ensure that the EMU0-1 lines 
can go from a logic low level to a logic high level in less than 10 us. This can be 
calculated as follows: 


tr = 5(Roullup x Ndevices * Cload_per_device) 
= 5(4.7 kQ x16 x 15 pF) 
5.64 us 
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11.7 Connections Between the Emulator and the Target System 


It is extremely important to provide high-quality signals between the emulator 
and the JTAG target system. Depending upon the situation, you must supply 
the correct signal buffering, test clock inputs, and multiple processor intercon- 
nections to ensure proper emulator and target system operation. 


Signals applied to the EMU0 and EMU1 pins on the JTAG target device can 
be either input or output (I/O). In general, these two pins are used as both input 
and output in multiprocessor systems to handle global run/stop operations. 
EMUO and EMU1 signals are applied only as inputs to the XDS510 emulator 
header. 


11.7.1 Buffering Signals 


If the distance between the emulation header and the JTAG target device is 
greater than six inches, the emulation signals must be buffered. If the distance 
is less than six inches, no buffering is necessary. The following illustrations 
depict these two situations. 


L} No signal buffering. In this situation, the distance between the header 
and the JTAG target device should be no more than six inches. 


— 6 Inches or Less | 


Voc 


JTAG Device Emulator Header 

EMUO EMUO PD 
EMU1 e EMU1 
TRST TRST 

TMS TMS 

TDI TDI 
TDO TDO 
TCK ’ TCK 
TCK_RET 


The EMUO and EMU1 signals must have pullup resistors connected to Vcc to 
provide a signal rise time of less than 10 us. A 4.7-kQ resistor is suggested for 
most applications. 


Connections Between the Emulator and the Target System 


_j Buffered transmission signals. |n this situation, the distance between 
the emulation header and the processor is greater than six inches. Emula- 
tion signals TMS, TDI, TDO, and TCK_RET are buffered through the same 


package. 
Greater Than 
6 Inches 
Vcc 
VQC 
JTAG Device Emulator Header 
EMUO EMUO PD 
EMU1 jl EMU1 
TRST TAST 
TMS <J e TS 
TDI e TDI 
TDO > TDO 
TCK TCK 
> TCK_RET Vv 


GND 


m TheEMUO and EMU1 signals must have pullup resistors connected to 
Vcc to provide a signal rise time of less than 10 us. A 4.7-kQ resistor is 
suggested for most applications. 


m The input buffers for TMS and TDI should have pullup resistors con- 
nected to Vcc to hold these signals at a known value when the emula- 
tor is not connected. A resistor value of 4.7 kQ or greater is suggested. 


m To have high-quality signals (especially the processor TCK and the 
emulator TCK_RET signals), you may have to employ special care 
when routing the PWB trace. You also may have to use termination 
resistors to match the trace impedance. The emulator pod provides 
optional internal parallel terminators on the TCK_RET and TDO. TMS 
and TDI provide fixed series termination. 


m Since TRST is an asynchronous signal, it should be buffered as 
needed to insure sufficient current to all target devices. 
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11.7.2 Using a Target-System Clock 


Figure 11-4 shows an application with the system test clock generated in the 
target system. In this application, the TCK signal is left unconnected. 


Figure 11-4. Target-System-Generated Test Clock 


Greater Than 
6 Inches 
Vcc 
JTAG Device Emulator Header ree 
EMUO EMUO PD 
EMU1 I EMU1 
TRST TRST 
TMS <}e TMS 
TDI e TDI 
TDO > TDO 
TCK NC TCK 
> TCK_RET Vv 
GND 


System Test Clock 


Note: Whenthe TMS/TDI lines are buffered, pullup resistors should be used to hold the buffer 
inputs at a known level when the emulator cable is not connected. 


There are two benefits to having the target system generate the test clock: 


(J The emulator provides only a single 10.368-MHz test clock. If you allow 
the target system to generate your test clock, you can set the frequency 
to match your system requirements. 


[J In some cases, you may have other devices in your system that require 
a test clock when the emulator is not connected. The system test clock 
also serves this purpose. 


11-10 


Connections Between the Emulator and the Target System 


11.7.3 Configuring Multiple Processors 


Figure 11-5 shows a typical daisy-chained multiprocessor configuration, 
which meets the minimum requirements of the IEEE 1149.1 specification. The 
emulation signals in this example are buffered to isolate the processors from 
the emulator and provide adequate signal drive for the target system. One of 
the benefits of this type of interface is that you can generally slow down the test 
clock to eliminate timing problems. You should follow these guidelines for 
multiprocessor support: 


Li The processor TMS, TDI, TDO, and TCK signals should be buffered 
through the same physical package for better control of timing skew. 


11 Theinput buffers for TMS, TDI, and TCK should have pullup resistors con- 
nected to Vcc to hold these signals at a known value when the emulator 
is not connected. A resistor value of 4.7 kQ or greater is suggested. 


(J Buffering EMUO and EMU1 is optional but highly recommended to provide 
isolation. These are not critical signals and do not have to be buffered 
through the same physical package as TMS, TCK, TDI, and TDO. Unbuf- 
fered and buffered signals are shown in this section (page 11-8 and page 
11-9). 


Figure 11-5. Multiprocessor Connections 


JTAG Device JTAG Device 


TDO 


VCC 
Emulator Header 


EMUO PD 


EMU1 


4 e TRST 


\\-e TS 


TDI 


St TDO 


@ St © TCK 


TCK_RET Vv 
GND 
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11.8 Mechanical Dimensions for the 14-Pin Emulator Connector 


The JTAG emulator target cable consists of a 3-foot section of jacketed cable, 
an active cable pod, and a short section of jacketed cable that connects to the 
target system. The overall cable length is approximately 3 feet 10 inches. 
Figure 11-6 and Figure 11-7 (page 11-13) show the mechanical dimensions 
for the target cable pod and short cable. Note that the pin-to-pin spacing on 
the connector is 0.100 inches in both the X and Y planes. The cable pod box 
is nonconductive plastic with four recessed metal screws. 


Figure 11-6. Pod/Connector Dimensions 


2.70 a> 


Emulator Cable wow Connector 
EES 
Short, Jacketed Cable 


RL 
Q 
Refer to Figure 11—7. 


All dimensions are in inches and are nominal dimensions, unless otherwise specified. 


Mechanical Dimensions for the 14-Pin Emulator Connector 


Figure 11-7. 14-Pin Connector Dimensions 


Cable 


ens 


Connector, Side View 


ee = = is 
oOo t 


0.87 
Cable 


Connector, Front View 
Pins 1, 3, 5, 7,9, 11,13 Pins 2, 4, 6, 8, 10, 12, 14 


Note: All dimensions are in inches and are nominal dimensions, unless otherwise specified. 
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11.9 Emulation Design Considerations 


This section describes the scan path linker (SPL), which can simultaneously 
add all four secondary JTAG scan paths to the main scan path. It also de- 
scribes how to use the emulation pins and configure multiple processors. 


11.9.1 Using Scan Path Linkers 
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You can use the Tl ACT8997 scan path linker (SPL) to divide the JTAG 
emulation scan path into smaller, logically connected groups of 4 to 16 
devices. As described in the Advanced Logic and Bus Interface Logic Data 
Book (literature number SCYDO01), the SPL is compatible with the JTAG 
emulation scanning. The SPL is capable of adding any combination of its four 
secondary scan paths into the main scan path. 


A system of multiple, secondary JTAG scan paths has better fault tolerance 
and isolation than a single scan path. Since an SPL has the capability of adding 
all secondary scan paths to the main scan path simultaneously, it can support 
global emulation operations, such as starting or stopping a selected group of 
processors. 


Tl emulators do not support the nesting of SPLs (for example, an SPL 
connected to the secondary scan path of another SPL). However, you can 
have multiple SPLs on the main scan path. 


Although the ACT8999 scan path selector is similar to the SPL, it can add only 
one of its secondary scan paths at a time to the main JTAG scan path. Thus, 
global emulation operations are not assured with the scan path selector. For 
this reason, scan path selectors are not supported. 


You can insert an SPL on a backplane so that you can add up to four device 
boards to the system without the jumper wiring required with nonbackplane 
devices. You connect an SPL to the main JTAG scan path in the same way you 
connect any other device. Figure 11—8 shows you how to connect a secondary 
scan path to an SPL. 


Emulation Design Considerations 


Figure 11-8. Connecting a Secondary JTAG Scan Path to an SPL 


JTAG 0 


TDI 
TMS 
TCK 
TRST 
TDO 


The TRST signal from the main scan path drives all devices, even those on 
the secondary scan paths of the SPL. The TCK signal on each target device 
onthe secondary scan path of an SPL is driven by the SPL’s DTCK signal. The 
TMS signal on each device on the secondary scan path is driven by the respec- 
tive DTMS signals on the SPL. 


DTDO on the SPL is connected to the TDI signal of the first device on the sec- 
ondary scan path. DTDI on the SPL is connected to the TDO signal of the last 
device in the secondary scan path. Within each secondary scan path, the TDI 
signal of a device is connected to the TDO signal of the device before it. If the 
SPL is on a backplane, its secondary JTAG scan paths are on add-on boards; 
if signal degradation is a problem, you may need to buffer both the TRST and 
DTCK signals. Although less likely, you may also need to buffer the DTMSn 
signals for the same reasons. 
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11.9.2 Emulation Timing Calculations for SPL 


The following examples help you to calculate the emulation timings in the SPL 
secondary scan path of your system. For actual target timing parameters, see 
the appropriate device data sheets. 


Assumptions: 


tsu(TTMS) Target TMS/TDI setup to TCK high 10 ns 
ta(TTDO) Target TDO delay from TCK low 15 ns 
td(bufmax) Target buffer delay, maximum 10 ns 
td(bufmin) Target buffer delay, minimum Ins 
t(bufskew) Target buffer skew between two devices 1.35 ns 


in the same package: 
[td(bufmax) — td(bufmin)] x 0.15 
t(TCKfactor) Assume a 40/60 duty cycle clock 0.4 


Given in the SPL data sheet: 


ta(DTMSmax) SPL DTMS/DTDO delay from TCK 31 ns 
low, maximum 

tsu(DTDLmin) DTDI setup time to SPL TCK 7 ns 
high, minimum 

td(DTCKHmin) SPL DTCK delay from TCK 2ns 
high, minimum 

ta(DTCKLmax) SPL DTCK delay from TCK 16 ns 


low, maximum 


There are two key timing paths to consider in the emulation design: 


L} The TCK-to-DTMS/DTDO path, called toq(TcK-DTMS) 
L} The TCK-to-DTDI path, called tog/TcK-DTDI) 
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Of the following two cases, the worst-case path delay is calculated to deter- 
mine the maximum system test clock frequency. 


Case 1: Single processor, direct connection, DTMS/DTDO timed from TCK low. 
ta(oTMSmax) + ty(oTCKHmin) * ‘su ¢rTms) 
‘pd eee) Ute Kfactor) 


_ [8ins + 2ns + 10ns] 
0.4 


= 107.5ns (9.3 MHz) 


+t 


ty (TTDO) xt ty (DTCKLmax) su (DTDLmin) 


t _ 
TCK-DTDI 
pay ) bic Kfactor) 


_ [15ns + 16ns + 7ns] 
0.4 


= 9.5ns (10.5 MHz) 


In this case, the TCK-to-DTMS/DTDL path is the limiting factor. 


Case 2: Single/multiprocessor, DTMS/DTDO/TCK buffered input, DTDI buffered out- 
put, DTMS/DTDO timed from TCK low. 


ta(oTMSmax) * 'otcKHmin) + tsucrtms) * “butskew) 


tod (TCK-TDMS) = 7 


TCKfactor) 


_ [8ins + 2ns + 10ns + 1.35ns] 
0.4 


= 110.9ns (9.0 MHz) 


taartpo) + tayotckLmax) + 'su(oTDLmin * t 


d (bufskew) 
tod (TCK-DTDI) = i 


TCKfactor) 


_ [15ns + 15ns + 7ns + 10ns] 
0.4 


= 120ns (8.3 MHz) 


In this case, the TCK-to-DTDI path is the limiting factor. 
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11.9.3 Using Emulation Pins 
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The EMU0/1 pins of TI devices are bidirectional, three-state output pins. When 
in an inactive state, these pins are at high impedance. When the pins are 
active, they function in one of the two following output modes: 


Lj) Signal Event 
The EMU0/1 pins can be configured via software to signal internal events. 
In this mode, driving one of these pins low can cause devices to signal 
such events. To enable this operation, the EMU0/1 pins function as open- 
collector sources. External devices such as logic analyzers can also be 
connected to the EMUO/1 signals in this manner. If such an external 
source is used, it must also be connected via an open-collector source. 


1) External Count 

The EMU0/1 pins can be configured via software as totem-pole outputs 
for driving an external counter. These devices can be damaged if the out- 
put of more than one device is configured for totem-pole operation. The 
emulation software detects and prevents this condition. However, the 
emulation software has no control over external sources on the EMUO0/1 
signal. Therefore, all external sources must be inactive when any device 
is in the external count mode. 


Tl devices can be configured by software to halt processing if their EMUO0/1 
pins are driven low. This feature, in combination with the use of the signal event 
output mode, allows one TI device to halt all other Tl devices on a given event 
for system-level debugging. 


If you route the EMU0/1 signals between boards, they require special handling 
because these signals are more complex than normal emulation signals. 
Figure 11—9 shows an example configuration that allows any processor in the 
system to stop any other processor in the system. Do not tie the EMU0/1 pins 
of more than 16 processors together in a single group without using buffers. 
Buffers provide the crisp signals that are required during a RUNB (run bench- 
mark) debugger command or when the external analysis counter feature is 
used. 
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Figure 11-9. EMUO/1 Configuration 
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Notes: 1) The low time on EMUx-IN should be at least one TCK cycle and less than 10 us. Software will set the EMUx-OUT 
pin to a high state. 


2) To enable the open-collector driver and pullup resistor on EMU1 to provide rising/falling edges of less than 25 ns, 
the modification shown in this figure is suggested. Rising edges slower than 25 ns can cause the emulator to detect 
false edges during the RUNB command or when the external counter selected from the debugger analysis menu 
is used. 


These seven important points apply to the circuitry shown in Figure 11-9 and 
the timing shown in Figure 11-10: 


[1 Open-collector drivers isolate each board. The EMUO0/1 pins are tied to- 
gether on each board. 


Lj At the board edge, the EMUO0/1 signals are split to provide IN/OUT. This 
is required to prevent the open-collector drivers from acting as a latch that 
can be set only once. 


Lj} The EMU0/1 signals are bused down the backplane. Pullup resistors are 
installed as required. 


(4) The bused EMU0/1 signals go into a PAL® device whose function is to 
generate a low pulse on the EMU0/1-IN signal when a low level is detected 
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on the EMU0/1-OUT signal. This pulse must be longer than one TCK 
period to affect the devices, but less than 10 us to avoid possible conflicts 
or retriggering, once the emulation software clears the device’s pins. 


During a RUNB debugger command or other external analysis count, the 
EMU0/1 pins on the target device become totem-pole outputs. The EMU1 
pin is a ripple carry-out of the internal counter. EMUO becomes a 
processor-halted signal. During a RUNB or other external analysis count, 
the EMUO0/1-IN signal to all boards must remain in the high (disabled) 
state. You must provide some type of external input (XCNT_ENABLE) to 
the PAL to disable the PAL from driving EMU0O/1-IN to a low state. 


If sources other than TI processors (such as logic analyzers) are used to 
drive EMU0/1, their signal lines must be isolated by open-collector drivers 
and be inactive during RUNB and other external analysis counts. 


You must connect the EMU0/1-OUT signals to the emulation header or di- 
rectly to a test bus controller. 
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Figure 11-10. Suggested Timings for the EMU0 and EMU1 Signals 
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Figure 11-11. EMU0/1 Configuration With Additional AND Gate to Meet Timing 
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Notes: 1) Thelowtime on EMUx-IN should be at least one TCK cycle and less than 10 us. Software will set the EMUx—OUT 
pin to a high state. 

2) To enable the open-collector driver and pullup resistor on EMU1 to provide rising/falling edges of less than 25 ns, 
the modification shown in this figure is suggested. Rising edges slower than 25 ns can cause the emulator to detect 
false edges during the RUNB command or when the external counter selected from the debugger analysis menu 
is used. 


XDS510 Emulator Design Considerations 11-21 


Emulation Design Considerations 


If itis not important that the devices on one target board are stopped by devices 
on another target board via the EM0/1, then the circuit in Figure 11-12 can be 
used. In this configuration, the global-stop capability is lost. It is important not 
to overload EMUO/1 with more than 16 devices. 


Figure 11-12.EMU0/1 Configuration Without Global Stop 
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Note: The open-collector driver and pullup resistor on EMU1 must be able to provide rising/falling edges of less than 25 ns. 
Rising edges slower than 25 ns can cause the emulator to detect false edges during the RUNB command or when the 
external counter selected from the debugger analysis menu is used. If this condition cannot be met, then the EMU0/1 


signals from the individual boards should be ANDed together (as shown in Figure 1-11 ) to produce an EMU0/1 signal for 
the emulator. 
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11.9.4 Performing Diagnostic Applications 


For systems that require built-in diagnostics, it is possible to connect the 
emulation scan path directly to a Tl ACT8990 test bus controller (TBC) instead 
of the emulation header. The TBC is described in the Texas Instruments Aa- 
vanced Logic and Bus Interface Logic Data Book (literature number 
SCYD001). Figure 11-13 shows the scan path connections of ndevices to the 
TBC. 


Figure 11-13. TBC Emulation Connections for n JTAG Scan Paths 
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In the system design shown in Figure 1—13, the TBC emulation signals TCKI, 
TDO, TMSO, TMS2/EVNTO, TMS3/EVNT1, TMS5/EVNT3, TCKO, and TDIO 
are used, and TMS1, TMS4/EVNT2, and TDI1 are not connected. The target 
devices’ EMUO and EMU14 signals are connected to Vcc through pullup resis- 
tors and tied to the TBC’s TMS2/EVNT0O and TMS3/EVNT1 pins, respectively. 
The TBC’s TCKI pin is connected to a clock generator. The TCK signal for the 
main JTAG scan path is driven by the TBC’s TCKO pin. 
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On the TBC, the TMSO pin drives the TMS pins on each device on the main 
JTAG scan path. TDO on the TBC connects to TDI on the first device on the 
main JTAG scan path. TDIO on the TBC is connected to the TDO signal of the 
last device on the main JTAG scan path. Within the main JTAG scan path, the 
TDI signal of a device is connected to the TDO signal of the device before it. 
TRST for the devices can be generated either by inverting the TBC’s 
TMS5/EVNTS signal for software control or by logic on the board itself. 
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Glossary 


AO-A30: External address pins for data/program memory or I/O devices. 
These pins are on the global bus. See also LAO-LA3O. 


address: The location of program code or data stored in memory. 


addressing mode: The method by which an instruction interprets its oper- 
ands to acquire the data it needs. 


ALU: See Arithmetic logic unit. 


analog-to-digital (A/D) converter: A successive-approximation converter 
with internal sample-and-hold circuitry used to translate an analog signal 
to a digital signal. 


ARAU: See auxiliary register arithmetic unit. 


arithmetic logic unit (ALU): The part of the CPU that performs arithmetic 
and logic operations. 


auxiliary registers (ARn): A set of registers used primarily in address gen- 
eration. 


auxiliary register arithmetic unit (ARAU): Auxiliary register arithmetic 
unit. A16-bit arithmetic logic unit (ALU) used to calculate indirect ad- 
dresses using the auxiliary registers as inputs and outputs. 


bit-reversed addressing: Addressing in which several bits of an address 
are reversed in order to speed processing of algorithms, such as Fourier 
transforms. 


BK: See block-size register. 
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Glossary 


block-size register: A register used for defining the length of a program 
block to be repeated in repeat mode. 


bootloader: A built-in segment of code that transfers code from an external 
memory or from a communication port to RAM at power-up. 


carry bit: A bitin status register ST1 used by the ALU for extended arithme- 
tic operations and accumulator shifts and rotates. The carry bit can be 
tested by conditional instructions. 


circular addressing: An addressing mode in which an auxiliary register is 
used to cycle through a range of addresses to create a circular buffer in 
memory. 


context save/restore: Asave/restore of system status (status registers, ac- 
cumulator, product register, temporary register, hardware stack, and 
auxiliary registers, etc.) when the device enters/exits a subroutine such 
as an interrupt service routine. 


CPU: Central processing unit. The unit that coordinates the functions of a 
processor. 


CPUcycle: The time it takes the CPU to go through one logic phase (during 
which internal values are changed) and one latch phase (during which 
the values are held constant). 


cycle: See CPU cycle. 


DO-D31: External data bus pins that transfer data between the processor 
and external data/program memory or I/O devices. See also LDO-LD31. 


data-address generation logic: Logic circuitry that generates the address- 
es for data memory reads and writes. This circuitry can generate one ad- 
dress per machine cycle. See also program-address generation logic. 


data-page pointer: A seven-bit register used as the seven MSBs in ad- 
dresses generated using direct addressing. 


decode phase: The phase of the pipeline in which the instruction is de- 
coded. 


DIE: See DMA interrupt enable register. 


Glossary 


DMAcoprocessor: Aperipheral that transfers the contents of memory loca- 
tions independently of the processor (except for initialization). 


DMA coniroller: See DMA coprocessor. 


DMA interrupt enable register (DIE): A register (in the CPU register file) 
that controls which interrupts the DMA coprocessor responds to. 


DP: See data-page pointer. 


dual-access RAM: Memory that can be accessed twice in a single clock 
cycle. For example, your code can read from and write to a dual-access 
RAM in one clock cycle. 


external interrupt: A hardware interrupt triggered by a pin. 


extended-precision floating-point format: A 40-bit representation of a 
floating-point number with a 32-bit mantissa and an 8-bit exponent. 


extended-precision register: A 40-bit register used primarily for extended- 
precision floating-point calculations. Floating-point operations use bits 
39-0 of an extended-precision register. Integer operations, however, use 
only bits 31-0. 


FIFO buffer: FFirst-in, first-out buffer. A portion of memory in which data is 
stored and then retrieved in the same order in which it was stored. Thus, 
the first word stored in this buffer is retrieved first. The 'C4x’s communica- 
tion ports each have two FIFOs: one for transmit operations and one for 
receive operations. 


hardware interrupt: An interrupt triggered through physical connections 
with on-chip peripherals or external devices. 


hit: A condition in which, when the processor fetches an instruction, the 
instruction is available in the cache. 
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IACK: = /nterrupt acknowledge signal. An output signal that indicates that an 
interrupt has been received and that the program counter is fetching the 
interrupt vector that will force the processor into an interrupt service rou- 
tine. 


IIE: See internal interrupt enable register. 
IIF: See IIOF flag register. 


IIOF flag register (IIF): Controls the function (general-purpose I/O or inter- 
rupt) of the four external pins (IIOFO to IIOF3). It also contains timer/DMA 
interrupt flags. 


index registers: Two registers (IRO and !IR1) that are used by the ARAU for 
indexing an address. 


internal interrupt: A hardware interrupt caused by an on-chip peripheral. 


internal interrupt enable register: A register (in the CPU register file) that 
determines whether or not the CPU will respond to interrupts from the 
communication ports, the timers, and the DMA coprocessor. 


interrupt: A signal sent to the CPU that (when not masked) forces the CPU 
into a subroutine called an interrupt service routine. This signal can be 
triggered by an external device, an on-chip peripheral, or an instruction 
(TRAP, for example). 


interrupt acknowledge (IACK): A signal that indicates that an interrupt has 
been received, and that the program counter is fetching the interrupt vec- 
tor location. 


interrupt vector table (IVT): An ordered list of addresses which each corre- 
spond to an interrupt; when an interrupt occurs and is enabled, the pro- 
cessor executes a branch to the address stored in the corresponding 
location in the interrupt vector table. 


interrupt vector table pointer (IVTP): A register (in the CPU expansion 
register file) that contains the address of the beginning of the interrupt 
vector table. 


ISR: /nterrupt service routine. A module of code that is executed in 
response to a hardware or software interrupt. 


IVTP: See interrupt vector table pointer. 


Glossary 


LAO-LA30: External address pins for data/program memory or I/O devices. 
These pins are on the local bus. See also AO—A30. 


LDO-LD31: External data-bus pins that transfer data between the processor 
and external data/program memory or I/O devices. See also DO—D31. 


LSB: Least significant bit. The lowest order bit in a word. 


machine cycle: See CPU cycle. 


mantissa: A component of a floating-point number consisting of a fraction 
and a sign bit. The mantissa represents a normalized fraction whose 
binary point is shifted by the exponent. 


maskable interrupt: A hardware interrupt that can be enabled or disabled 
through software. 


memory-mapped register: One of the on-chip registers mapped to ad- 
dresses in memory. Some of the memory-mapped registers are mapped 
to data memory, and some are mapped to input/output memory. 


MFLOPS: Millions of floating-point operations per second. A measure of 
floating-point processor speed that counts of the number of floating-point 
operations made per second. 


microcomputer mode: A mode in which the on-chip ROM is enabled. This 
mode is selected via the MP/MC pin. See also MP/MC pin; microproces- 
sor mode. 


microprocessor mode: Amodeinwhich the on-chip ROM is disabled. This 
mode is selected via the MP/MC pin. See also MP/MC pin; microcomput- 
er mode. 


MIPS: Million instructions-per-second. 


miss: A condition in which, when the processor fetches an instruction, it is 
not available in the cache. 


MSB: Most significant bit. The highest order bit in a word. 


multiplier: A device that generates the product of two numbers. 


NMI: See Nonmaskable interrupt. 
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nonmaskable interrupt (NMI): A hardware interrupt that uses the same 
logic as the maskable interrupts, but cannot be masked. It is often used 
as a soft reset. 


overflow flag (OV) bit: Astatus bit that indicates whether or not an arithme- 
tic operation has exceeded the capacity of the corresponding register. 


PC: See program counter. 


peripheral bus: A bus that the CPU uses to communicate the DMA copro- 
cessor, communication ports, and timers. 


pipeline: A method of executing instructions in an assembly-line fashion. 


program counter: A register that contains the address of the next instruc- 
tion to be fetched. 


RC: See repeat counter register. 


read/write (R/W) pin: This memory-control signal indicates the direction of 
transfer when communicating to an external device. 


register file: A bank of registers. 


repeat counter register: A register (in the CPU register file) that specifies 
the number of times minus one that a block of code is to be repeated 
when a block repeat is performed. 


repeat mode: A zero-overhead method for repeating the execution of a 
block of code. 


reset: A means to bring the central processing unit (CPU) to a known state 
by setting the registers and control bits to predetermined values and 
signaling execution to fetch the reset vector. 


reset pin: This pin causes the device to reset. 


ROMEN: ROM enable. An external pin that determines whether or not the 
the on-chip ROM is enabled. 


Glossary 


RW: See read/write pin. 


short-floating-point format: A 16-bit representation of a floating-point 
number with a 12-bit mantissa and a 4-bit exponent. 


short-integer format: A twos-complement 16-bit format for integer data. 
short-unsigned-integer format: A 16-bit unsigned format for integer data. 
sign extend: Fill the high order bits of a number with the sign bit. 


single-access RAM: SARAM. Memory that can be read from or written to 
only once in a single CPU cycle. 


single-precision floating-point format: A 32-bit representation of a float- 
ing point number with a 24-bit mantissa and an 8-bit exponent. 


single-precision integer format: A twos-complement 32-bit format for in- 
teger data. 


single-precision unsigned-integer format: A 32-bit unsigned format for 
integer data. 


software interrupt: Aninterrupt caused by the execution of a TRAP instruc- 
tion. 


splitmode: Amode of operation of the DMA coprocessor. This mode allows 
one DMA channel to service both the receive and transmit portions of a 
communication port. 


ST: See status register. 


stack: Ablockofmemory reserved for storing and retrieving data on a first-in 
last-out basis. It is usually used for storing return addresses and for pre- 
serving register values. 


status register: A register (in the CPU register file) that contains global in- 
formation related to the CPU. 


Timer: A programmable peripheral that can be used to generate pulses or 
to time events. 


Timer-Period Register: Timer-period register. A 32-bit memory-mapped 
register that specifies the period for the on-chip timer. 
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trap vector table (TVT): An ordered list of addresses which each corre- 
spond to an interrupt; when a trap is executed, the processor executes 
a branch to the address stored in the corresponding location in the trap 
vector table. 


trap vector table pointer (TVTP): Aregister (inthe CPU expansion-register 
file) that contains the address of the beginning of the trap vector table. 


TVTP: See trap vector table pointer. 


unified mode: A mode of operation of the DMA coprocessor. The mode is 
used mainly for memory-to-memory transfers. This is the default mode 
of operation for a DMA channel. See also split mode. 


wait state: A period of time that the CPU must wait for external program, 
data, or |/O memory to respond when reading from or writing to that ex- 
ternal memory. The CPU waits one extra cycle for every wait state. 


wait-state generator: A program that can be modified to generate a limited 
number of wait states for a given off-chip memory space (lower program, 
upper program, data, or I/O). 


zero fill: Fill the low or high order bits with zeros when loading a number into 
a larger field. 
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FIFO buffer, definition A-3 
filters 
adaptive 6-7 
See also adaptive filters 
digital. See digital filters 
example 6-10 
FIR 6-14 
See also FIR filters 
IIR 6-12 to 6-15 
See also IIR filters 
lattice. See lattice filter 
FIR filter 
adaptive 6-15 
benchmarks 6-8 
FIR filters 6-7, 6-14 
circular addressing 6-7 
example 6-7 
features 6-7 
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FlX instruction 3-9 
FLOAT instruction 3-9 
floating point 

conversion (to/from IEEE) 3-19 

formats 3-19 

IEEE 3-20 

pop andpush 2-8 
floating-point, reciprocal 3-12 

example 3-16 
floating-point division 3-12 
floating-point number, inverse, example 3-14 
formats, floating point 3-19 
forward lattice filter, example 6-19 
FRIEEE instruction 3-19 
fully-connected network 8-19 


general-purpose applications vii 
GIE 2-11, 2-13 
global bus 4-3 
control signals 4-11 
global memory interface. See memory interface 
graphics/imagery applications — vii, x 


half-word manipulation 3-4 
hardware interrupt, definition A-3 
header 

14-pin 11-2 

dimensions 14-pin 11-2 
hexagonal grid 8-19 
hit, definition A-3 
hotline 10-3 


IACK, definition A-4 

ICFULL interrupt, example 8-2 
ICRDY communication port 7-7 
ICRDY interrupt, example 8-2 


IEEE 1149.1 specification, bus slave device 
rules 11-3 


IEEE Customer Service, address 11-3 
IEEE standard 11-3 


IIE. See internal interrupt enable register 
IIF. See IIOF flag register 
IIOF flag register (IIF) 7-5 
definition A-4 
IIR filters 6-7, 6-9, 6-9 
benchmarks 6-10, 6-12 to 6-15 
index registers, definition A-4 
initialization, boot.asm 1-9 
initialization routine 1-6 
input port 8-16 
integer division 3-9 
example 3-11 
interface, SRAM 4-8 
two strobes 4-10 
interfaces 
external. See external interfacing 
parallel processing 8-18 
shared bus 4-22 
internal interrupt, definition A-4 
internal interrupt enable register, definition A-4 
interrupt, definition A-4 
interrupt acknowledge (IACK) definition A-4 
interrupt flag register 2-11 
interrupt programming, procedure 2-11 
interrupt service routine, INT2 2-13 
interrupt service routine (ISR), definition A-4 
interrupt vector table (IVT), definition A-4 
interrupt vector table pointer (IVTP), definition A-4 
interrupts 
communication port 8-3 
context switching 2-14 
context-switching 2-11 
DMA 7-4 
dual services, example 2-12 
example 3-2 
examples 2-11 
IVTP reset 2-12 
nesting 2-13 
NMI 2-11 
priorities 2-11 
programming 2-11 
service routines 2-11, 2-13 
software polling, example 2-11 
vector table 2-11 
inverse Fourier transform 6-24 
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inverse lattice filter, example 6-18 
inverse of floating point 3-12 
ISR. See interrupt service routine (ISR) 


IVTP 2-12 
See also interrupt vector table pointer 


IVTP register 2-11 


JTAG 11-14 


JTAG emulator 
buffered signals 11-9 
connection to target system 11-1 to 11-24 
no signal buffering 11-8 
podinterface 11-4 


jumps 2-4 


LAO-LA30, definition. See A0-A30 
LAJ instruction 2-4, 5-5 
lattice filter structure 6-17 


lattice filters 6-17, 6-18 
applications 6-17 
benchmarks 6-20 
forward 6-19 


LBb LBUb instructions 3-4 
LDO-LD31, definition. See DO-D31 
LHw, LHUw instructions 3-4 


linker command file 1-6 
example 1-9 


literature vi, 10-3 
LMS algorithm 6-13 


localbus 4-3 
control signals 4-11 


local memory interface. See memory interface 


local memory interface control register (LMICR), 
LSTRB ACTIVE field 4-9 


loop, delayed block repeat, example 2-19 
loop optimization, example 5-3 
loops 2-18 
single repeat 2-20 
LSB, definition A-5 
LWLct, LWRect instructions 3-4 
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machine cycle. See CPU cycle 
mantissa, definition A-5 
maskable interrupt, definition A-5 


matrix vector multiplication, data-memory organiza- 


tion 6-21 
MBct, MHct instructions 3-4 
medical applications — vii, xiii 
memory, object exchange, example 5-2 
memory device timing 4-6 
memory interface 4-12 
global 4-4 
local 4-4 
ready generation 4-11 
shared global 4-21 
strobes 4-7 
two banks 4-8 
wait states 4-11 
memory interface (local, global) 
RAM (zero wait states) 4-7 
shared bus 4-22 


memory interface control registers 4-12 
LSTRB ACTIVE field 4-8 
PAGESIZE field 4-8, 4-18 


memory interfacing, introduction 4-1 
memory map 4-4 
memory-mapped register, definition A-5 


message broadcasting 8-20 
communication ports 8-21 


MFLOPS, definition A-5 


microcomputer mode, definition. See microproces- 


sor mode 


microprocessor mode, definition. See microcomput- 


er mode 
military applications — vii, xii 
MIPS, definition A-5 
miss, definition A-5 
MPYI3 instruction 3-18 
MPYSHIS instruction 3-18 
MSB, definition A-5 
mu-law 
compression, expansion 6-2 
conversion, linear 6-2 
multimedia applications _ vii, xii 
multiplication, matrix vector 6-21 
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multiplier, definition A-5 


networks 

distributed-memory 4-21 

parallel connectivity 8-18 
Newton-Raphson algorithm 3-12, 3-15 
NMI 2-13 

See also nonmaskable interrupt 
nomenclature 10-9 
nonmaskable interrupt (NMI), definition A-6 
normalization 3-15 


OCEMPTY interrupt, example 8-2 
OCRDY interrupt, example 8-2 
operations 

examples 3-1 

introduction 3-1 

logical instructions 3-2 
output enable (OE) controls 4-5 
output modes 

external count 11-18 

signal event 11-18 
output port 8-15 
overflow flag (OV) bit, definition A-6 


packing data example 3-4 
page, switching 4-18 
page switching, example 4-19 
PAL 11-19, 11-20, 11-22 
parallel instruction set, optimization use 5-5 
parallel processing 
’C4x to’C4x 8-20 
distributed memory 8-19 
shared and distributed memory 8-19 
shared bus 4-22 
shared memory 4-21, 8-19 
part numbers 
device 10-11 
tools 10-12 
part-order information 10-9 
PC. See program counter 


peripheral bus, definition A-6 
phone numbers, manufacturer 10-6 
pipeline, definition A-6 
pipelined linear array 8-18 
PLD equations 4-16 
polling method, communication port 
POP instruction 2-7, 2-14 
POPF instruction 2-7 
port driver circuit, diagram 8-16 
primary channel 7-14 
processor, delays 4-5 
processor initialization 1-6 
C language 1-9 
example 1-7 
introduction 1-1 
product vector 6-21 
program control 
instructions 2-1 
introduction 2-1 
program counter, definition A-6 
programming tips 7-2 
introduction 5-1 
protocol, bus 11-3 
pulldown resistor 8-5 
pullups 1-5, 8-5 
PUSH instruction 2-7, 2-14 
PUSHF instruction 2-7 


queues (stack) 2-9 


R/W. See read/write pin 

RAM, zero wait states 4-7 
RAMS 4-8 

RAMs 4-5 

RC. See repeat counter register 
RCPF instruction 3-9, 3-12 
readsynce 7-13 

read/write (R/W) pin, definition A-6 
ready control logic 4-14 

ready generation 4-11 

ready signals 4-12 
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regional technology centers 10-5 
register file, definition A-6 
registers 
optimization use 5-5 
repeat count (RC) 2-20 
stack pointer (SP) 2-7 
regular subroutine call, example 2-3 
repeat count register (RC) 2-20 
repeat counter register, definition A-6 
repeat mode, definition A-6 
repeat modes, block repeat, restrictions 2-19 
reset 
definition A-6 
multiprocessing 1-5 
rise/fall time 1-4 
signal generation 1-3 
vector locations 
vector mapping 
voltage 1-3 
reset circuit, diagram 1-3 
reset pin 
definition A-6 
voltage, diagram 1-4 
RETIcond instruction 2-13 
RETScond instruction 2-2 
ROMEN, definition A-6 
RPTB and RPTBD instructions 6-24 
optimization use 5-5 
RPTB instruction 2-18 
RPTBD instruction 2-18 
RPTS instruction 2-18 
example 2-18, 3-3 
optimization use 5-5 
RSQRF instruction 3-15, 3-16 
RTCs 10-5 
run/stop operation 11-8 
RUNB, debugger command 11-18, 11-19, 11-20, 
11-21, 11-22 
RUNB_ENABLE, input 11-20 


scan path linkers 11-14 
secondary JTAG scan chain to an SPL 11-15 
suggested timings 11-21 
usage 11-14 

scan paths, TBC emulation connections for JTAG 
scan paths 11-23 


1-2 
1-2 
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seminars 10-5 
serial resistors 8-5 
shared bus interface 4-22 
shared memory 4-21 
short floating point format, definition A-7 
short integer format, definition A-7 
short unsigned integer format, definition A-7 
signal descriptions 14-pinheader 11-2 
signal quality 8-5 
signals 
buffered 11-9 
buffering for emulator connections 11-8 to 11-11 
description 14-pinheader 11-2 
timing 11-5 
sign-extend, definition A-7 
single-access RAM (SARAM), definition A-7 
single-precision floating-point format, definition A-7 
single-precision integer format, definition A-7 
single-precision unsigned-integer format, defini- 
tion A-7 
slave devices 11-3 
slow devices, OR 4-12 
sockets 10-6 
325-pin C40, 304-pin’C44 10-6 
software development tools 
assembler/linker 10-2 
C compiler 10-2 
digital filter design package 10-2 
general 10-12 
linker 10-2 
simulator 10-2 


software interrupt, definition A-7 
software polling, interrupts, example 2-11 
software stack 2-2, 2-11 
speech/voice applications _ vii, x 
split mode, definition A-7 
split mode (DMA) 7-5 
split-mode 7-13, 7-14 
square root, calculation 3-15 
ST. See status register 
stack 2-7 

definition A-7 
stack pointer 2-7 
stack pointer (SP), application 2-7 
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stacks 
growth 2-8 
high-to-low memory, diagram 2-9 
low-to-high memory, diagram 2-9 
user 2-8 
status register, definition A-7 
straight, unshrouded 14-pin 11-3 
STRBx SWW 4-12 


strobes 4-9 
wait states 4-7 
style (manual) iv 
SUBB instruction 3-17, 3-18 
SUBC instruction 3-9 
SUBI instruction 3-18 
subroutine 2-21 
subroutines 2-4, 2-14 
calls. See calls 
support tools 
development 10-10 
device 10-10 
support tools nomenclature 10-9 
symbols (used in manual) iv 
system configuration 4-2 
possible 4-2 
system configuration stack, diagram 2-8 
system initialization 1-3 
system stacks 2-7 
stack pointer 2-7 


target cable 11-12 

target system, connection to emula- 
tor 11-1 to 11-24 

target-system clock 11-10 

TCK signal 11-2, 11-3, 11-5, 11-6, 11-11, 11-15, 
11-16, 11-23 

TDI signal 11-2, 11-3, 11-4, 11-5, 11-6, 11-7, 11-10, 
11-11, 11-16, 11-17 

TDO output 11-3 

TDO signal 11-3, 11-4, 11-6, 11-7, 11-17, 11-23 

technical assistance xiv, 10-3 

telecommunications applications vii, xii 

test bus controller 11-20, 11-23 


test clock 11-10 
diagram 11-10 


third-party support 10-3 
Timer, definition A-7 
Timer Period Register, definition A-7 
timing 
bank switching 4-20 
page switching 4-20 


timing calculations 11-6 to 11-7, 11-16 to 11-24 
TMS, signal 11-3 


TMS signal 11-2, 11-4, 11-5, 11-6, 11-7, 11-10, 
44-11, 11-15, 11-16, 11-17, 11-23 


TMS/TDI inputs 11-3 

TOIEEE instruction 3-19 

token forcer 8-15 

token forcer circuit, diagram 8-15 

tools, partnumbers 10-12 

tools nomenclature 10-9 

transfer function 6-9 

trap vector table (TVT), definition A-8 

trap vector table pointer (TVTP), definition A-8 
tree structures 8-18 


TRST signal 11-2, 11-5, 11-6, 11-11, 11-15, 11-16, 
11-24 


TSTB instruction 3-2 
TVTP. See trap vector table pointer 


twiddle factor 6-32 
fast Fourier transforms (FFT) 6-41 


Index 


unified mode, definition. See split mode 
unpacking data example 3-5 


wait state, definition A-8 

wait states 4-5, 4-11, 4-15 
consecutive reads, then write 4-6 
consecutive writes, thenread 4-7 
full-speed 4-5 
logic 4-14 
memory device timing. See memory device tim- 

ing 

wait-state generator, definition A-8 

workshops’ 10-5 

write cycles, RAM requirements 4-6 


XDS510 emulator, JTAG cable. See emulation 


zero fill, definition A-8 
zero overhead subroutine call, example 2-5 


ZIF PGA socket 
handle-activated, diagram 10-8 
tool-activated, diagram 10-7 
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